# Data Visualization with Python

## Introduction

Data visualization is part of data exploration, which is a critical step in the AI cycle. You will use this technique to gain understanding and insights to the data you have gathered, and determine if the data is ready for further processing or if you need to collect more data or clean the data. 

You will also use this technique to present your results. 

In this notebook, we will focus on data visualization using the matplotlib library, a popular data visualization library in Python. 

## Context

We will be working with Jaipur weather data obtained from [Kaggle](https://www.kaggle.com/rajatdey/jaipur-weather-forecasting/version/2), a platform for data enthusiasts to gather, share knowledge and compete for many prizes!

The data has been cleaned and simplified, so that we can focus on data visualization instead of data cleaning. 

## Data in agriculture

Imagine that you are a farmer. What would be your main concerns? You want to have the best conditions for your crops to grow and bear good yield. 

The weather is a crucial factor to the success of your farm (unless you are running an [indoor farm](https://www.nanalyze.com/2019/01/indoor-farming-agriculture-indoors/) ). Thankfully, you can more easily find weather data now with the prevalence of sensor technology. There are also numerous open data sources out there for you to collect your weather data. 

- [India open data source](https://data.gov.in/keywords/weather)
- [Korea open data source](https://www.data.go.kr/e_main.jsp#/L21haW4=)
- [Poland open data source](https://index.okfn.org/place/pl/)

Alright, you have obtained your data (JaipurFinalCleanData.csv). This file contains weather information of Jaipur.

What do you do next?

### Side note: What is csv?

CSV (Comma-Separated Value) is a file containing a set of data, separated by [commas](http://w3c.github.io/csvw/use-cases-and-requirements/egypt-referendum-2012-result-csv-in-textwrangler.png).

We usually access these files using spreadsheet applications such as Excel or Google Sheet. Do you know how this is done?

Today, we will learn how to use Python to open csv files. 

## Use Python to open csv files

We will use the [pandas](https://pandas.pydata.org/) library to work with our dataset. Pandas is a popular Python library for data science. It offers powerful and flexible data structures to make data manipulationa and analysis easier. 

## Import Pandas


In [None]:
import pandas as pd #import pandas as pd means we can type "pd" to call the pandas library

Now that we have imported pandas, let's start by reading the csv file.

# Mounting Gdrive & Importing Data

For accessing files on colab we must mount the drive using the code below and then enter the correct path for the file in the drive so as to access it.

In [None]:
from google.colab import drive #After running this code sign in using your Google account in which you have uploaded the "JaipurFinalCleanData.csv" file. 
drive.mount('/content/drive')#Once done copy the authorisation code and paste it in the box below

In [None]:
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv('/content/drive/My Drive/JaipurFinalCleanData.csv') #in case of error check if the file path is correct or not

## Exploring our data

Great! We have now a variable to contain our weather data. Let's explore our data. Use the .head() function to see the first few rows of data. 

In [None]:
#dataframe.head() means we are getting the first 5 rows of data
# try running it to see what data is in the jaipur csv file
print (dataframe.head())

In [None]:
dataframe.dtypes

## Importing matplotlib

[Matplotlib](https://matplotlib.org/) is a Python 2D plotting library that we can use to produce high quality data visualization. It is highly usable (as you will soon find out), you can create simple and complex graphs with just a few lines of codes!

Now let's load matplotlib to start plotting some graphs

In [None]:
import matplotlib.pyplot as plt
import numpy as np

## Scatter plot

Scatter plots use a collection of points on a graph to display values from two variables. This allow us to see if there is any relationship or correlation between the two variables. 

Let's see how mean temperature changes over the years! 

In [None]:
x = dataframe.date
y = dataframe.mean_temperature

In [None]:
plt.scatter(x,y)
plt.show()

Do you see that the x axis is filled with a thick line, and that there's no tick label available? This makes us unable to analyze the data. 

Let's try to modify this scatter plot so that we can see the ticks!

### Choose only several ticks
One reason why there's a thick bar below the x axis is that there are numerous labels (2 years daily data) on the x axis. 

The first thing we are going to do is to then reduce the number of ticks/ points for the x axis. We do this using the np.arrange function as below:

In [None]:
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 30))
plt.show()

See, now you can see the numbers clearer, but they are still overlapping.
### Task 8: Change x ticks interval so that you can see the dates clearly

In [None]:
#yourcodehere

What is the interval you use so that you can see all the dates? Do you notice that now we are only having very few ticks? 

Let's try to rotate our ticks. See the example on [Stackoverflow](https://stackoverflow.com/questions/10998621/rotate-axis-text-in-python-matplotlib)!

Note: Stackoverflow is a site where technical personnel gather and share their knowledge. You can search for any queries over the site and see if there are already others who solve it!

### Task 9: Rotate our x ticks label so that we can see more ticks more clearly

In [None]:
#yourcodehere

Now we can see the x-ticks clearly. 

Notice how temperature changes according to the time of the year. Compare it with this [website](https://weather-and-climate.com/average-monthly-Rainfall-Temperature-Sunshine,jaipur,India). Does it inform you when to best plant your crop?

### Giving label to the x and y axis

You can also give label to the x and y axis. This will make it easier for you to visualise and share your data. 

In [None]:
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels and set a font size
plt.xlabel ("Date", fontsize = 14)
plt.ylabel ("Mean Temperature", fontsize = 14)

plt.show()

Looks good! 

### Task 10: Now, let's add a title. 
See how to do it [here](https://python-graph-gallery.com/4-add-title-and-axis-label/). 

In [None]:
#yourcodehere

### Task 11: Change the title size to be bigger than the x and y labels!

In [None]:
#yourcodehere

Good! Now, we can also change the size of the plot 

In [None]:
# Change the default figure size
plt.figure(figsize=(10,10))

plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

# Add x and y labels, title and set a font size
plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes
plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')


plt.show()

Looking good! Now, let's customize our graphs with the shapes and colours that we like. See [here](https://python-graph-gallery.com/131-custom-a-matplotlib-scatterplot/) for examples

### Task 12: Change your marker shape!

In [None]:
#yourcodehere

### Changing color

You can also change the marker color. Check out the code below which show you how to do it!

In [None]:
# Change the default figure size
plt.figure(figsize=(10,10))

plt.scatter(x,y, c='green', marker='*')
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
    
# Add x and y labels, title and set a font size
plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)

# Set the font size of the number labels on the axes
plt.xticks (fontsize = 12)
plt.yticks (fontsize = 12)

plt.xticks (rotation=30, horizontalalignment='right')


plt.show()

Look at your working directory and check if the new image file has been created!

### Task 13: Change the data points to your favourite colour. 
Change the font colour and size too!

In [None]:
#yourcodehere

If you are wondering how to get the nice bubble looking scatter plot in the slides. Here a sample code, try running it!

In [None]:
import numpy as np

colors = np.random.rand(len(y)) # Create random value
#plt.scatter(x,y,c=colors,alpha=0.5)
plt.scatter(x,y,c=colors,alpha=0.5)
plt.show()

### Task 14: Try changing the alpha value and see what happens?

### Saving plot

You can use plt.savefig("figurename.png") to save the figure. Try it!

In [None]:
plt.savefig("jaipur_scatter_plot.png")

## Line Plots

Besides showing relationship using scatter plot, time data as above can also be represented with a line plot. Let's see how this is done!

In [None]:
y = dataframe.mean_temperature

plt.plot(y)
plt.ylabel("Mean Temperature")
plt.xlabel("Time")

Y_tick = ['May16','Jul16','Sept16','Nov16','Jan17','Mar17','May17','Jul17','Sept17','Nov17','Jan18','Mar18' ]

plt.xticks(np.arange(0, 731, 60), Y_tick , rotation=30)
plt.xticks()

plt.show()

### Task 15: Change the labels and add title so that it is clearer and easier for you to show this graph to others

### Drawing multiple lines in a plot

In [None]:
x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature

plt.plot(x,y_1, label = "Max temp") 
plt.plot(x,y_2, label = "Min temp") 

plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)

plt.legend()
plt.show()

### Task 16: Draw at least 3 line graphs in one plot!

In [None]:
#yourcodehere

## Histograms

The histogram is useful to look at desity of values. For example, you might want to know how many days are hotter than 35C so that you can see what types of plants would survive better in your climate zone. The histogram allows us to see the probability distribution of our variables

Let's look at how histograms are plotted.

In [None]:
y = dataframe.mean_temperature

plt.hist(y,bins=10)

plt.show()

What does the above means?

Let's label the graph clearly as follows:

- Title: Probability distribution of temperature over 2 years (2016 - 2018) in Jaipur
- Y-axis: No.of days
- X-axis: Temperature

In [None]:
y = dataframe.mean_temperature

plt.hist(y,bins=10)

plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016 - 2018) in Jaipur')

plt.show()

What is the mode of this dataset? what temperature range is represented the most/ the least?

### Task 17: What do you think are bins? Try changing the number of bins to 20. What do you notice?

In [None]:
#yourcodehere

What does the histogram tell you about the temperature in Jaipur over the last two years?

## Bar Charts

Bar chart looks like histogram, but they are not the same! See the difference between bar charts and histogram [here](https://www.edrawsoft.com/histogram-vs-bar-chart.php)

Now, head over to the matplotlib library and look at the example for bar charts. Here's the [link](https://pythonspot.com/matplotlib-bar-chart/)! 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
 
objects = ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')
y_pos = np.arange(len(objects))
usage = [10,8,6,4,2,1]
 
plt.bar(y_pos, usage, align='center')
plt.xticks(y_pos, objects) # name the xticks
 
plt.show()

Because we are not dealing with categories with Jaipur weather data, we will not use it to make a barchart. However, do remember how to create your bar chart!

## Boxplots

Boxplots is used to determine the distribution of our dataset. 

We will explore boxplot using a sample tutorial obtained from matplotlib [website](https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py)

### First, we will create random data for our example

In [None]:
np.random.seed(10)
data = np.random.normal(100, 10, 200)

### Next, we will draw our boxplot

In [None]:
fig1, ax1 = plt.subplots()
ax1.set_title('Basic Plot')
ax1.boxplot(data)
plt.show()

Great! you now have your boxplot. Do you remember how to read it? See [here](http://www.physics.csbsju.edu/stats/box2.html)

From the reading, can you notice what are the values of the following:
- Outlier
- Median
- First quartile
- Third quartile

Now that you've understand how boxplot is plotted, apply boxplot to the data on temperature of Jaipur.

In [None]:
y=dataframe.mean_temperature

plt.boxplot(y)
plt.show()

What does the boxplot tell you about the temperature of Jaipur over the past two years?

## Subplots

Many times, you want to plot more than one graphs side by side. You can use subplots to do that!

Here's how you can make them!

In [None]:
x = dataframe.date
y = dataframe.mean_temperature

fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) #create 2 subplots with shared y axis
fig.suptitle('Sharing Y axis')
ax1.plot(x, y)
ax2.scatter(x, y)
plt.show()

### Task 18: Create subplots with 3 plots where the y axis is shared!

In [None]:
#yourcodehere

### Task 19: Create subplots where the x axis is shared!
Check out this [link](https://matplotlib.org/gallery/subplots_axes_and_figures/subplots_demo.html) to see how this is done!

In [None]:
#yourcodehere

# Challenge Questions:

Load the nz_weather.csv dataset from your work folder. 

Source of data: https://raw.githubusercontent.com/plotly/datasets/master/nz_weather.csv.

## Load the data

In [None]:
#yourcodehere

## Print the top 5 entries

In [None]:
#yourcodehere

## Draw 2 line graph in one plot, showing Christchurch's and Hamilton's weather throughout the year

In [None]:
#yourcodehere

From the plot, can you see which city has higher temperature? You can calculate the [mean value](https://www.geeksforgeeks.org/python-pandas-dataframe-mean/) to find out

## Create a histogram for Auckland weather

In [None]:
#yourcodehere

See the graph? do you notice a data point that might be an outlier?

## Create a boxplot for Auckland weather

In [None]:
#yourcodehere

## Create a subplot showing all boxplots of the various areas

In [None]:
#yourcodehere

Do you see the ourlier? What might we do with this outlier?

### Great! You have now gained the ability to visualize data using matplotlib. You'll use this skill again throughout experience stage!