# Introduction to Python for Data Exploration and Statistics


Thus far we have been learning Python basics. These are great for computer scientists, but of course, what we as social scientists and humanists want to do is analyze data (which we learned last week is objectified information!). This week we'll analyze reall data using our first Python module, also called a library or package, called Pandas. A module is simply bundles of functions other people wrote that you will then re-use. You have to first install a module before you can use it. If you installed the Anaconda distribution of Python Pandas, and most of the modules we will use, comes pre-installed. If you did not install the Anaconda distribution you will have to install Pandas yourself (using `pip` or some other method).

Like Python, each module has its own syntax that you have to learn how to write correctly. We'll start to learn the Pandas syntax today.

<i>Pandas</i> is a popular and flexible package whose primary use is its datatype: the <i>DataFrame</i>. The dataframe is essentially a spreadsheet, like you would find in Excel, but it has some tricks up its sleeve!

As we will see, Pandas allows us to do basic statistics easily, allows us to compare columns, and allows us to do quick and easy visualizations. 

We will keep practicing these uses of Pandas throughout the semester. Today, I'm just planting the seed.


# Reminder: our growing Python toolkit

It's always helpful to keep in mind all the tools we have learned. We will continue to use these throughout the semester. I'll list some important ones here, just to keep reinforcing what we've learned.

* values (e.g. `1.2`, `100`, `'Hello, Boston!'`)
* variables (e.g., `float`, `int`, `string`)
* operators (e.g., `=`, `+`, `-`)
* logical operators (e.g., `==`, `>`, `<`, `>=`)
* statements and expressions (e.g. `10 + 500`)
* built-in functions (e.g. `print()`, `type()`)
* string functions and string methods (e.g., `string.lower()`, `string.islower()`)
* list functions and list metods (e.g., `len(mylist)`, `mylist.append()`)
* conditionals (e.g., `if`, `else`, `elif`)
* loops (e.g., `for` loops)
* user-defined functions (using `def`)

# Relative File Structures

We will need to read in a file from our hard drive (our secondary memory) into our primary memory. We're going to use Pandas functions to do this today, as we'll be working with dataframes, but later we'll read in plain text files.

We will use the *relative* file structure to do so, in order to make our code reproducible. You should put this script in your `scripts` folder, and you should put the .csv in your `data` folder. If you don't do this correctly, you'll get an error below that we'll trouble shoot.

I found [this blog] (https://desktop.arcgis.com/en/arcmap/10.3/tools/supplement/pathnames-explained-absolute-relative-unc-and-url.htm) offers a good explanation for files and file structures. To quote from that blog:

> Relative path
> A relative path refers to a location that is relative to a current directory. Relative paths make use of two special symbols, a dot (.) and a double-dot (..), which translate into the current directory and the parent directory. Double dots are used for moving up in the hierarchy. A single dot represents the current directory itself.
> 
> In the example directory structure below, assume you used Windows Explorer to navigate to D:\Data\Shapefiles\Soils. After navigating to this directory, a relative path will use D:\Data\Shapefiles\Soils as the current directory (until you navigate to a new directory, at which point the new directory becomes the current directory). The current directory is sometimes referred to as the root directory.
> 
> If you wanted to navigate to the Landuse directory from the current directory (Soils), you could type the following in the Windows Explorer Address box:
> 
> ..\\Landuse
Windows Explorer would navigate to D:\Data\Shapefiles\Landuse. A few more examples using D:\Data\Shapefiles\Landuse as the current directory are below:
> 
> ..               (D:\Data\Shapefiles)  
> ..\\..            (D:\Data)  
> ..\\..\Final      (D:\Data\Final)  
> .                (D:\Data\Shapefiles\Landuse - the current directory)  
> .\\..\Soils       (D:\Data\Final\Soils)  
> ..\\..\\.\Final\\..\Shapefiles\.\Landuse  (D:\Data\Shapefiles\Landuse)



# The Pandas Dataframe

We're going to jump into Pandas using real data.

******************************
The data we'll analyze today comes from:

National Center for Education Statistics, United States Department of Education. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) [Data file]. Available from http://nces.ed.gov/ecls/kindergarten.asp

I selected five variables (columns) to analyze:

* reading_score = READING IRT SCALE SCORE
* math_score = MATH IRT SCALE SCORE
* knowledge_score = GENERAL KNOWLEDGE IRT SCALE SCORE
* p2income = TOTAL HOUSEHOLD INCOME
* incomecat = INCOME CATEGORES
    * 1 = low income: < \$40,000
    * 2 = mid income
    * 3 = high income: >= \$70,000
    
The unit of observation (row) is the individual kindergartner. The file is a comma-separated file, with utf-8 encoding.
   
## Motivating Question

**Are math, reading, and general knowledge scores related to household income in any predictable way?**


In [None]:
#import our library
#this is the simplest way to import a module
# if you get a `module not found` error it means you have not installed this particular module
import pandas

Like Python more generally, Pandas relies on functions. These are not in-built functions in Python, but Pandas-specific functions. We will use the `read_csv()` function, and note the `pandas` prefix: `pandas.read_csv()`.

In [None]:
#Note the relative file structure
#We'll go up one directory, and then down into the `data/` directory:

df = pandas.read_csv("../data/education_dataset.csv", sep=',', encoding='utf8')

#always check your data type!
type(df)

In [None]:
#It's a pandas object - only pandas functions will work on it
#Let's take a look using the head() function

df.head()

In [None]:
#or view the entire dataframe
df

## Dataframe slicing 

Like list and string slicing, but with some quirks to use with dataframes

In [None]:
#syntax to extract columns
df['reading_score'].head()

In [None]:
#reminder, you can look at the entire thing:
df['reading_score']

In [None]:
#extract one row: notice the syntax
df.iloc[0]

## Summary Statistics

Ok, we know what we want to do with quantitative dataframes, we want to summarize them.

In [None]:
## Summary statistics
df['reading_score'].mean()

In [None]:
df['reading_score'].sum()

In [None]:
df['reading_score'].std()

In [None]:
# Or, we can find it all at the same time (for quantitative columns)

df.describe()

## Differences between means 

What if we want to know if the mean is different across categories? This is one of the more common uses of data analysis. For example, we might want to know of the average wage is different for men and women. We don't have gender in our data, but we do have income category (see the description above). We'll use that to compare means across our three scores. Remember our motivating question:

**Are math, reading, and general knowledge scores related to household income in any predictable way?**


First step, are the average different for each of our income categories?

To do so, we can use two methods.

First, the "manual" method. We'll use boolean statements to create three separate dataframes.

In [None]:
#Use of boolean operators with dataframes:

df['incomecat']==1

In [None]:
#slice out the rows where the condition is true, and 
df[df['incomecat']==1].head()

In [None]:
#save it as a new variable, including our other income categories:
df_incomecat1 = df[df['incomecat']==1]
df_incomecat2 = df[df['incomecat']==2]
df_incomecat3 = df[df['incomecat']==3]

In [None]:
#now, print our our means

print(df_incomecat1['reading_score'].mean())
print(df_incomecat2['reading_score'].mean())
df_incomecat3['reading_score'].mean()

In [None]:
#Pandas has a way to do this without creating new dataframews! The Pandas groupby function
#create a new dataframe that is grouped by income category

df_grouped = df.groupby('incomecat')
df_grouped

In [None]:
df_grouped['reading_score'].mean()

# Visualization

We can also use visualizations to explore our data. We'll just touch on this today. We'll learn how to make pretty visualizations later.

We'll use another library for this: `matplotlib`, the most popular Python visualization libarary

In [None]:
#note the different syntax here, we're going to rename the library to something shorter when we import it
import matplotlib.pyplot as plt

In [None]:
#Always start with histograms!
df.hist()
plt.show()

In [None]:
#That's not pretty. Let's show just one

df['knowledge_score'].hist()
plt.show()

In [None]:
#Other options:
#Scatter plot: is math and reading scores correlated?
#note the synax: this is the basic syntax for plotting

df.plot(kind='scatter', x = 'reading_score', y = 'math_score')
plt.show()

In [None]:
## Plot average by income
## remember our grouped by plot
## Let's first make another dataframe from it

df_grouped_mean = df_grouped.mean()
df_grouped_mean

In [None]:
## We can plot this like we would the original dataframe!

df_grouped_mean.plot(kind='bar')
plt.show()

In [None]:
#Not great! What's the issue?

#Let's pull out three columns: notice the syntax - double brackets!

df_grouped_mean[['reading_score', 'math_score', 'knowledge_score']].plot(kind='bar')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol = 3)
plt.show()

# Exercises!

In [None]:
## Exercise 1: Slice out and print the knowledge score column from the dataframe (df)
df['knowledge_score'].head()

In [None]:
## Exercise 2: extract the third row in the dataframe (careful! remember that Python indexes start at 0)

df.iloc[2]

In [None]:
##Exercise 3: find the mean, median, and standard deviation for the knowledge score column.
# Note: I didn't teach you median, but see if you can recognize patterns and intuit how to do it.

print(df['knowledge_score'].mean())
print(df['knowledge_score'].median())
print(df['knowledge_score'].std())

In [None]:
#Exercise 4: print out the mean knowledge score separately 
# for those in income categories 1, 2, and 3 respetively.

df_grouped['knowledge_score'].mean()

In [None]:
#Exercise 5: produce a histogram of the knowledge score column

df['knowledge_score'].hist()

In [None]:
#Exercise 6: just based on visuals alone, is there a stronger relationship between math score and general knowledge,
#Or reading score and general knowledge?
df.plot(kind='scatter', x = 'reading_score', y = 'knowledge_score')
df.plot(kind='scatter', x = 'math_score', y = 'knowledge_score')

plt.show()

In [None]:
# Exercise 7: Do something creative! What other visualizations can you produce? What other relationships?
# Maybe produce different scatter plots from our three dataframes from the different income categores
# e.g. df_incomecat1, df_incomecat2, df_incomecate3.
# Are the relationships between variables stronger in different income categories? (visually inspected)

# If you want, check out the matplotlib documentation and see if you can do things like change colors,
# change axis names, or other features of your visualizations.

df_grouped_mean[['reading_score', 'math_score', 'knowledge_score']].T.plot(kind='bar')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.4), ncol = 3)
plt.show()

In [None]:
df_grouped_mean.plot(kind='scatter', x = 'reading_score', y = 'knowledge_score')
df_grouped_mean.plot(kind='scatter', x = 'math_score', y = 'knowledge_score')