## Data Analysis with pandas

We are going to examine data from the [Bangalore Open Data Repository](https://github.com/openbangalore/bangalore). Bangalore is the 3rd most populous city in India and widely regarded as the Silicon Valley of India.

### Task 1 - Load Data

- Read in the data to pandas
- Check the shape of the dataframe
- Check the summary statistics of the dataframe
- Change the row index to the "Year" column

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('bangalore_temparature.tsv', sep='\t')

In [3]:
data.head(2)

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,1901,23.094,24.243,25.398,27.74,26.328,24.735,23.969,24.154,25.097,24.301,23.09,21.0
1,1902,21.588,22.93,26.277,27.442,27.118,25.477,24.428,25.019,23.94,23.759,22.79,22.184


In [4]:
data.shape

(102, 13)

In [5]:
data.describe()

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
count,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0
mean,1951.5,22.122088,24.028069,26.375049,27.844608,27.108118,24.923186,23.985824,24.125373,24.347304,24.100167,22.882186,21.759961
std,29.588849,0.61782,0.732336,0.656179,0.55863,0.636558,0.547904,0.485182,0.373501,0.520555,0.458388,0.581238,0.554744
min,1901.0,20.699,22.145,24.791,26.725,25.378,23.621,22.77,23.09,23.189,22.838,21.693,20.648
25%,1926.25,21.76825,23.50225,25.965,27.452,26.6915,24.601,23.62,23.856,23.9775,23.827,22.43825,21.313
50%,1951.5,22.187,24.074,26.4625,27.899,27.231,24.9325,24.0225,24.0805,24.351,24.1495,22.9345,21.8265
75%,1976.75,22.47675,24.534,26.7485,28.21375,27.57925,25.25075,24.2935,24.43025,24.6855,24.354,23.234,22.11425
max,2002.0,23.53,26.134,28.048,29.068,28.272,26.427,25.19,25.019,25.869,25.413,24.478,23.124


In [6]:
data.index = data.Year

In [7]:
data.head(2)

Unnamed: 0_level_0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1901,1901,23.094,24.243,25.398,27.74,26.328,24.735,23.969,24.154,25.097,24.301,23.09,21.0
1902,1902,21.588,22.93,26.277,27.442,27.118,25.477,24.428,25.019,23.94,23.759,22.79,22.184


### Task 2 - Initial Data Analysis

- (Chart) How does temperature vary over the year (X-axis is months)?
- (Chart) How does temperature vary over the years (X-axis is years)?
- Which months had the highest and lowest temperatures in 1960?
- What were the highest, lowest and mean values in 1960?
- Where were the highest, lowest and mean values in an arbitray year (hint: write a function)?
- Which months had the highest and lowest gains in temperature?
- Make a histogram, KDE plot and bar plot of the gains.

### Task 3 - Visualization

matplotlib
- Make a histogram for a particular month.
- Compare distributions for 2 months (histograms on subplots).
- Compare two years in the same histogram (use color coding).
- Define a function to compare two months in the same frame.
- Create a box plot for 1 month.
- Create a box plot for 2 months on 1 figure.
- Create a function that compares 2 months via boxplot.

seaborn
- Make a KDE plot of one month.
- Make a function that compares the KDE plots of 2 months.
- Compare the histogram and KDE plots of 2 months on the same figure.
- Create a Facetgrid version of the KDE plots. Loop through a months array.
- Create a violin plot for 1 month.
- Create a violin plot for 2 months on 1 figure.
- Create a function that compares 2 months via violin plot.

### Task 4 - Insight

- Visually represent the varying temperatures over the entire period.
- Find the coldest and warmest months for May over the entire period.