# Assignment 4

Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to **Preview the Grading** for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find **at least** two datasets on the web which are related, and that you visualize these datasets to answer a question with the broad topic of **religious events or traditions** (see below) for the region of **Dexter, Michigan, United States**, or **United States** more broadly.

You can merge these datasets with data from different regions if you like! For instance, you might want to compare **Dexter, Michigan, United States** to Ann Arbor, USA. In that case at least one source file must be about **Dexter, Michigan, United States**.

You are welcome to choose datasets at your discretion, but keep in mind **they will be shared with your peers**, so choose appropriate datasets. Sensitive, confidential, illicit, and proprietary materials are not good choices for datasets for this assignment. You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, bitbucket, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations. You are welcome to provide multiple visuals in different languages if you would like!

As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.

Here are the assignment instructions:

 * State the region and the domain category that your data sets are about (e.g., **Dexter, Michigan, United States** and **religious events or traditions**).
 * You must state a question about the domain category and region that you identified as being interesting.
 * You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
 * You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo's principles of truthfulness, functionality, beauty, and insightfulness.
 * You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

What do we mean by **religious events or traditions**?  For this category you might consider calendar events, demographic data about religion in the region and neighboring regions, participation in religious events, or how religious events relate to political events, social movements, or historical events.

## Tips
* Wikipedia is an excellent source of data, and I strongly encourage you to explore it for new data sources.
* Many governments run open data initiatives at the city, region, and country levels, and these are wonderful resources for localized data sources.
* Several international agencies, such as the [United Nations](http://data.un.org/), the [World Bank](http://data.worldbank.org/), the [Global Open Data Index](http://index.okfn.org/place/) are other great places to look for data.
* This assignment requires you to convert and clean datafiles. Check out the discussion forums for tips on how to do this from various sources, and share your successes with your fellow students!

## Example



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import matplotlib.colors as col
import matplotlib.cm as cm
from matplotlib.pyplot import figure
import seaborn as sns
import re
%matplotlib notebook
plt.style.use('seaborn-colorblind')

# LOAD DATA 

In [2]:
df=pd.read_csv('Temperature in Ann Arbour.csv')
df.head()

Unnamed: 0,ID,Date,Element,Data_Value
0,USW00094889,11/12/2014,TMAX,22
1,USC00208972,4/29/2009,TMIN,56
2,USC00200032,5/26/2008,TMAX,278
3,USC00205563,11/11/2005,TMAX,139
4,USC00200230,2/27/2014,TMAX,-106


In [3]:
df.dtypes

ID            object
Date          object
Element       object
Data_Value     int64
dtype: object

In [4]:
df.columns

Index(['ID', 'Date', 'Element', 'Data_Value'], dtype='object')

# Data visualization before Clean

In [30]:
df.plot();

<IPython.core.display.Javascript object>

In [28]:
from matplotlib.pyplot import figure
figure(figsize=(10, 10), dpi=50)
sns.barplot(x = 'Element',y='Data_Value',data=df);
plt.xlabel('Elements',fontsize = 25, weight = 'bold',color='green');
plt.ylabel('Data_Value',fontsize = 25, weight = 'bold',color='red');
#plt.savefig('p2.png')
plt.show()

<IPython.core.display.Javascript object>

In [7]:
 import statistics

In [8]:
mean= statistics.mean(df['Data_Value'])
mean

95

In [29]:
df.plot.hist(alpha=0.7);

<IPython.core.display.Javascript object>

# CLEAN DATA

In [10]:
# Text preprocessing steps - remove numbers in string, captial letters and punctuation
import re
import string
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())


In [11]:
#df['Element'] = df['Element'].str.title()

In [12]:
#df['Element'] = df['Element'].str.lower()

In [13]:
#df['ID'] = df.ID.map(punc_lower)
#df['Month'] = df.Month.map(punc_lower)
#df['Element'] = df.Element.map(punc_lower)
#df['Data_Value'] = df.Data_Value.map(alphanumeric).map(punc_lower)
#df['Day'] = df.Day.map(alphanumeric).map(punc_lower)

In [14]:
df.head()

Unnamed: 0,ID,Date,Element,Data_Value
0,USW00094889,11/12/2014,TMAX,22
1,USC00208972,4/29/2009,TMIN,56
2,USC00200032,5/26/2008,TMAX,278
3,USC00205563,11/11/2005,TMAX,139
4,USC00200230,2/27/2014,TMAX,-106


In [15]:
df.isnull()

Unnamed: 0,ID,Date,Element,Data_Value
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [16]:
df.dropna(inplace=True)
df.fillna(0)
df

Unnamed: 0,ID,Date,Element,Data_Value
0,USW00094889,11/12/2014,TMAX,22
1,USC00208972,4/29/2009,TMIN,56
2,USC00200032,5/26/2008,TMAX,278
3,USC00205563,11/11/2005,TMAX,139
4,USC00200230,2/27/2014,TMAX,-106
5,USW00014833,10/1/2010,TMAX,194
6,USC00207308,6/29/2010,TMIN,144
7,USC00203712,10/4/2005,TMAX,289
8,USW00004848,12/14/2007,TMIN,-16
9,USC00200220,4/21/2011,TMAX,72


In [17]:
df.isnull().sum()

ID            0
Date          0
Element       0
Data_Value    0
dtype: int64

In [18]:
df.duplicated()


0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
165055    False
165056    False
165057    False
165058    False
165059    False
165060    False
165061    False
165062    False
165063    False
165064    False
165065    False
165066    False
165067    False
165068    False
165069    False
165070    False
165071    False
165072    False
165073    False
165074    False
165075    False
165076    False
165077    False
165078    False
165079    False
165080    False
165081    False
165082    False
165083    False
165084    False
dtype: bool

In [19]:
df.drop_duplicates()

Unnamed: 0,ID,Date,Element,Data_Value
0,USW00094889,11/12/2014,TMAX,22
1,USC00208972,4/29/2009,TMIN,56
2,USC00200032,5/26/2008,TMAX,278
3,USC00205563,11/11/2005,TMAX,139
4,USC00200230,2/27/2014,TMAX,-106
5,USW00014833,10/1/2010,TMAX,194
6,USC00207308,6/29/2010,TMIN,144
7,USC00203712,10/4/2005,TMAX,289
8,USW00004848,12/14/2007,TMIN,-16
9,USC00200220,4/21/2011,TMAX,72


In [20]:
df.describe()


Unnamed: 0,Data_Value
count,165085.0
mean,95.422116
std,123.515131
min,-343.0
25%,0.0
50%,94.0
75%,189.0
max,406.0


# Data Manipulation and Visualization

In [21]:
# convert values from tenths of degrees Celsius to degrees Celsius
df['Data_Value'] = df['Data_Value']/10

# convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# add month and day column
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# remove leap days
df = df[~((df['Date'].dt.month == 2) & (df['Date'].dt.day == 29))]

# create dataframe for 2015 data
df_15 = df[ df['Date'].dt.year == 2015 ]

# remove data which is not in the 2004-2015 time period
df = df[(df['Date'].dt.year >= 2005) & (df['Date'].dt.year <= 2014)]

In [22]:
# extract max and min temperatures
df_max = df[ df['Element'] == 'TMAX' ].drop(['Element'], axis='columns')
df_min = df[ df['Element'] == 'TMIN' ].drop(['Element'], axis='columns')

# group by date to get record high and low temperatures for each day
df_max = df_max.groupby(by=['Month','Day'], axis=0).aggregate({'Data_Value':np.max})
df_min = df_min.groupby(by=['Month','Day'], axis=0).aggregate({'Data_Value':np.min})

df_max = df_max.reset_index()
df_min = df_min.reset_index()

x = list(range(len(df_max['Data_Value'])))
y_max = list(df_max['Data_Value'])
y_min = list(df_min['Data_Value'])

In [23]:
# create dataframe for 2015 - data
df_15_max = df_15[ df_15['Element'] == 'TMAX' ]
df_15_min = df_15[ df_15['Element'] == 'TMIN' ]

df_15_max = df_15_max.groupby(by='Date', axis=0).aggregate({'Data_Value':np.max})
df_15_min = df_15_min.groupby(by='Date', axis=0).aggregate({'Data_Value':np.min})

df_15_max = df_15_max.reset_index()
df_15_min = df_15_min.reset_index()

df_15_max = df_15_max[ df_15_max['Data_Value'] > df_max['Data_Value'] ]
df_15_min = df_15_min[ df_15_min['Data_Value'] < df_min['Data_Value'] ]


In [27]:
import matplotlib.pyplot as plt
import seaborn as sns

# create plot
plt.figure()

plt.plot(x, y_max, color='indianred', label='Highest Temperatures (2005-2014)')
plt.plot(x, y_min, color='skyblue', label='Lowest Temperatures (2005-2014)')
plt.fill_between(x, y_min, y_max, color='lightgray', alpha=0.6)

# overlay scatterplot from 2015 data
plt.scatter(df_15_max.index, df_15_max['Data_Value'], color='black', label=None)
plt.scatter(df_15_min.index, df_15_min['Data_Value'], color='black', label='2015 temperatures which broke 10-year record high/low')

plt.ylim(ymin = -38)

xticks = [1, 32, 60, 91, 121, 152, 182, 213, 244, 274, 305, 335]
xticks_lables = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sept','Oct','Nov','Dec']
plt.xticks(xticks, xticks_lables)

plt.title('Highest and lowest temperatures in Ann Arbor, Michigan, US from 2005 to 2015')
plt.ylabel('Temperature [C°]')
plt.legend(loc=[0.25,0.02])

fig = plt.gcf()
fig.set_size_inches(12,7)

ax1 = plt.gca() # primary axes
ax2 = ax1.twinx() # seondary axes

ax1.grid(False)
ax2.grid(False) # turns off the grid lines

# make frame invisible 
for spine in ax2.spines:
    ax2.spines[spine].set_visible(False)
for spine in ax1.spines:
    ax1.spines[spine].set_visible(False)
    
ax1.spines['bottom'].set_visible(True)
ax1.spines['bottom'].set_alpha(0.3)

# remove ticks
ax1.tick_params(axis=u'both', which=u'both',length=0)
ax2.tick_params(axis=u'both', which=u'both',length=0)
plt.tick_params(right= False, labelright = False)

ax1.grid(True, alpha = 0.1)

plt.show()
plt.savefig('TemertureNuhaResult.png')

<IPython.core.display.Javascript object>

In [26]:
sns.pairplot(df, hue='Month', diag_kind='kde', size=2);

<IPython.core.display.Javascript object>

  binned = fast_linbin(X,a,b,gridsize)/(delta*nobs)
  FAC1 = 2*(np.pi*bw/RANGE)**2
