## MEJ Python Data Analysis & Visualization Training

Summer 2020

At this point, you should already have python and jupyter notebooks downloaded using Anaconda. Therefore, nothing new should need to be downloaded or installed ahead. If at any point you receive an error that a library isn’t installed, simply type “pip install package” in the terminal.  

During this training, we’ll learn how to bring in, explore, visualize, and analyze our data using some common commands.


### Step 1: Import packages

In [None]:
import numpy as np # numpy is used for scientific computing and working with arrays
import matplotlib.pyplot as plt # matplotlib is used for plotting
import pandas as pd # pandas is used for data analysis and makes working with data tables easier
import statsmodels.formula.api as smf # statsmodels is an easy to use package for regression that works very similar to R


### Step 2: Read in the data

Note that the data you read in needs to be in the same directory (folder) as your notebook

In [None]:
df = pd.read_csv("EJSCREEN_2019.csv")
df.head()

### Step 3: Explore the data

First, try out .shape to see the size of your dataframe. The shape function is a numpy  funtion that can be used to get the current shape of an array. For example, for dataframe or matrix, shape will first return the number of rows, followed by the number of columns.

In [None]:
# Your code here

Alternatively, if you just want to see who many rows your dataframe is, you can use the len() function. The length function is a built-in python function that can be used on different types of objects. For instance len("apple") will return 5 and len(df) will return the number of rows in your dataframe.

In [None]:
# Your code here

To see a list of the column names, use the built-in python list() function with the pandas .columns function, which returns column labels.

In [None]:
# Your code here

Now let's see how many null values we have by column. Pandas doesn't like to print out all the row values, so since we have 78 columns we need to change the pandas display options to show max rows (you can also do this for max columns).

In [None]:
pd.set_option('display.max_rows', None)
df.isnull().sum()

For this dataset, null values are input as "None" - this is why python told us some columns have mixed datatypes and null values counts sum to 0. If we were going to use these columns we need to change all "None" values to null values. 

First let's confirm which columns have mixed datatypes - they will have an object datatype.

In [None]:
# Use .dtypes 
# Your code here

Now we'll use .replace(old value, new value) to replace None values with np.nan - null values. Remember to save your results by saving the output to df and check your results by seeing how many null values there are now.

In [None]:
# Your code here

Finally we need to change columns with object data types to float data types. Check your results by using dtypes again.

In [None]:
df[['PM25', 'OZONE', 'DSLPM', 'CANCER', 'RESP', 'PWDIS']] = df[['PM25', 'OZONE', 'DSLPM', 'CANCER', 'RESP', 'PWDIS']].astype(float)
# ID, the census tract number, should also be an integer rather than a float so we'll change that here too
df['ID'] = df['ID'].astype(int) 
df.dtypes

Since we're only interested in Colorado for now, let's create a subset of our data. 

In [None]:
co = df[df.STATE_NAME == 'Colorado']
print(len(co))
co.head()

Now let's look at some basic statistics about our data using .describe()

In [None]:
# Your code here

There seems to be a large variance in Cancer Risk. Let's look at the 5 census tracts with the largest Cancer Risk.

In [None]:
co.sort_values(by = ['CANCER'], ascending = False)[['ID', 'ACSTOTPOP', 'CANCER']].head(5)

### Step 3: Visualize the data

Now that we're familiar with our data, let's start visualizing our data. First let's look at the distribution of Cancer Risk.

In [None]:
# use range with bins to define bins of size 20 that start at 0 and end at 600
plt.hist(co["CANCER"], bins = range(0,600,20)) 
plt.xlabel("Total Cancer Risk (per million)") # add an x label
plt.ylabel("Number of Census Tracts") # add a y label
plt.title("Cancer Risk Distribution") # add a title
plt.show() # show the figure

While plt.YourPlotStyle is the most straightforward method to create a chart using matplotlib, you can also define the number of subplots first. This gives you more options for plotting. Notice how some of the commands change slightly for seting an x lable and title. There's also no need for plt.show().

In [None]:
fig, ax = plt.subplots(ncols = 1) # here you're defining 1 subplot, which is called ax
ax.scatter(co["ACSTOTPOP"], co["CANCER"])
ax.set_xlabel("Population") # add an x label
ax.set_ylabel("Total Cancer Risk (per million)") # add a y label
ax.spines['right'].set_visible(False) # you can get rid of unwanted axes with this code
ax.spines['top'].set_visible(False)
_ = ax.set_title("Population versus Cancer Risk") # _ prevents this command from printing out the title

Explore some other relationships with your own chart below.

In [None]:
# Your code goes here

### Step 4: Analyze the data

Now let's explore these relationships further through some analysis.

What do you observe in your scatter plots? Does there appear to be a relationship between your variables? Calculate the correlation between these two variables. Do you think this relationship is causal, or just a correlation?

In [None]:
# Use df[['v1', 'v2']].corr()
# Your code here

Next let's explore correlations between all our variables. Are there any surprising relationships?

In [None]:
# Your code here
# You can use style.background_gradient(cmap = 'RdYlGn') to color your matrix

Let's run a regression on a relationship of interest.

In [None]:
# Use model = smf.ols(formula = 'y ~ x', data = df).fit() and print(model.summary())
# Your code here

Describe your results.

In [None]:
# Your explaination here

With a one unit increase in PM 2.5, the Cancer Risk Indicator increases by 3.24 points. Our p-value is very high and we see a relatively low r-squared value, meaning PM 2.5 doesn't explain much variation in Cancer Risk.