In [127]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Reading and Cleaning the Data

I saved the pdf as a word document, and copy and pasted the data into a spreadsheet and saved it as a `.csv` file.  Next, we want to make sure that everything was read in as the correct data type.

In [128]:
#read the data and name it gf
gf = pd.read_csv('data/paige_data.csv')

In [129]:
#examine the first five rows of the data
gf.head()

Unnamed: 0,State,Total,American Indian/ Alaska,Asian/Pacific Islander,Hispanic,Black,White,Economically disadvantaged,Limitd English proficiency,Studentswith disabilities
0,Alaska,70.0,54.0,76.0,70.0,61.0,76.0,59.0,47.0,46.0
1,Arizona,76.0,63.0,84.0,70.0,71.0,84.0,71.0,24.0,65.0
2,Arkansas,84.0,78.0,84.0,78.0,78.0,87.0,79.0,77.0,79.0
3,California,78.0,72.0,90.0,73.0,66.0,86.0,73.0,62.0,61.0
4,Colorado,75.0,58.0,82.0,62.0,66.0,82.0,61.0,53.0,54.0


In [130]:
#examine the data type for the variables
#note that object is not a numeric kind of data 
gf.dtypes

State                         object
Total                         object
American Indian/ Alaska       object
Asian/Pacific Islander        object
Hispanic                      object
Black                         object
White                         object
Economically disadvantaged    object
Limitd English proficiency    object
Studentswith disabilities     object
dtype: object

### Cleaning the Data

Here, I want every column except the first to be read in as numbers, and I'll set them all to floating point or decimal numbers.

In [131]:
#make a list of column names I want to change
names = gf.columns[1:]

In [132]:
#reset values to numbers in each column except states
for i in names:
    gf[i] = pd.to_numeric(gf[i], errors='coerce')

In [133]:
gf.dtypes

State                          object
Total                         float64
American Indian/ Alaska       float64
Asian/Pacific Islander        float64
Hispanic                      float64
Black                         float64
White                         float64
Economically disadvantaged    float64
Limitd English proficiency    float64
Studentswith disabilities     float64
dtype: object

### Descriptive Statistics

Here, we can get a nice summary of the columns using the `df.describe()` command.  

In [113]:
gf.describe()

Unnamed: 0,Total,American Indian/ Alaska,Asian/Pacific Islander,Hispanic,Black,White,Economically disadvantaged,Limitd English proficiency,Studentswith disabilities
count,47.0,45.0,47.0,47.0,47.0,47.0,47.0,47.0,47.0
mean,79.893617,68.844444,86.404255,71.234043,69.06383,84.914894,71.191489,60.0,60.276596
std,6.878555,10.674855,5.570339,7.667585,7.545238,5.408648,6.765123,12.341165,13.76306
min,59.0,45.0,74.0,53.0,48.0,71.0,58.0,23.0,24.0
25%,76.0,63.0,84.0,67.0,64.5,82.0,66.0,53.0,54.0
50%,81.0,70.0,87.0,73.0,71.0,86.0,72.0,62.0,64.0
75%,85.0,77.0,90.0,77.0,74.5,89.0,75.0,68.0,70.5
max,89.0,88.0,95.0,86.0,84.0,93.0,85.0,83.0,81.0


### Plotting Data

Here are some plots using the traditional `matplotlib` plotting library as well as plotting straight from the dataframe as shown in the cheatsheet.

In [175]:
#empty values will cause plots problems so I drop them from data here
gf = gf.dropna()

In [176]:
plt.figure()
plt.scatter(gf['White'], gf['Economically disadvantaged'])

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x140446278>

In [149]:
plt.figure()
plt.hist(gf['Economically disadvantaged'])

<IPython.core.display.Javascript object>

(array([  3.,   3.,   7.,   4.,   4.,  10.,   6.,   2.,   4.,   2.]),
 array([ 58. ,  60.7,  63.4,  66.1,  68.8,  71.5,  74.2,  76.9,  79.6,
         82.3,  85. ]),
 <a list of 10 Patch objects>)

In [177]:
gf.plot.box(fontsize = 6, rot = 70)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x14042a390>

In [165]:
#additional customizations in plot
gf.plot?