# The Python Numerical stack and EDA

# Table of Contents 

- Learning Goals
- Introduction to Numpy
- Introduction to Pandas
- Beginning Exploratory Data Analysis (EDA)


## Part 0: Learning Goals 
We load a dataset first as a numpy array and then as a pandas dataframe, and begin exploratory data analysis (EDA). 
By the end of this lab, you will be able to:
- Create and manipulate one-dimensional and two-dimensional numpy arrays, and pandas series and dataframes.
- Describe how to index and "type" Pandas Series and Dataframes.
- Create histograms and scatter plots for basic exploratory data analysis

In [None]:
# To draw things on the notebook instead of a separate window
%matplotlib inline 
# Import necessary libraries
import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm # allows us easy access to colormaps
import matplotlib.pyplot as plt # sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
# sets up pandas table display
#pd.set_option('display.width', 500)
#pd.set_option('display.max_columns', 100)
#pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options

## Part 1: Introduction to Numpy
Scientific Python code uses a fast array structure - the numpy array(0-indexed) which is listy! We can compute length, slice, and iterate. 

In [None]:
my_array = np.array([1, 2, 3, 4]) #create
my_array             # display
print(len(my_array)) # length
print(my_array[2:4]) # slice
for ele in my_array: # loop
    print(ele)
print(my_array.mean()) # calcuate mean by method call
print(np.mean(my_array)) # calculate mean using numpy 
np.ones(10) # generates 10 floating point ones
np.ones(10, dtype='int') # generates 10 integer ones
np.zeros(10)
np.random.random(10) # uniform on [0,1]
# generate random numbers from a normal distribution with mean 0 and variance 1
normal_array = np.random.randn(1000)
print("The sample mean = %f standard devation = %f" %(np.mean(normal_array), np.std(normal_array)))
#numpy supports vector operations 
first = np.ones(5)
second = np.ones(5)
first + second
first + 1
first*5
# if you wanted the distribution N(5,7) you could do:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)

ones_2d = np.ones([3, 4]) # 3 x 4 array of ones
ones_2d.shape # show size
ones_2d.T # trasnpose
np.sum(normal_5_7) # sum all elements

## Part 2:  Introduction to Pandas 

Often data is stored in comma separated values (CSV) files. CSV files can be output by any spreadsheet software, and are plain text, hence are a great way to share data. 
### Importing data with numpy
Below we'll read in automobile data from a CSV file, storing the data in Python's memory first as a numpy array.  
**Read car_data_description first.**

In [None]:
arrcars = np.genfromtxt('data/mtcars.csv', delimiter=',', skip_header=1, usecols=(1,2,3,4,5,6,7,8,9,10,11))
print(arrcars.shape)
print(arrcars[0:2]) # not very nice

We need a data structure that can represent the columns in the data by their name, can easily store variables of different types, that stores column names, and that we can reference by column name as well as by indexed position and have built-in functions that we can use to manipulate it. 
Pandas is a package/library that does all of this!  The library is built on top of numpy.  There are two basic pandas objects, *series* and *dataframes*, which can be thought of as enhanced versions of 1D and 2D numpy arrays, respectively.  Pandas attempts to keep all the efficiencies that `numpy` gives us.
### Importing data with pandas
Now let's read in our automobile data as a pandas *dataframe* structure. And look into the first five rows of data.

In [None]:
# Read in the csv files
data = pd.read_csv("data/mtcars.csv")
type(dfcars)

In [None]:
data.head()

What we have now is a spreadsheet with indexed rows and named columns, called a *dataframe* in pandas.  `data` is an *instance* of the pd.DataFrame *class*, created by calling the pd.read_csv and it has methods (functions) belonging to it. 
A pandas dataframe is a set of columns pasted together into a spreadsheet. The columns in pandas are called *series* objects.
Notice the poorly named first column: "Unnamed: 0". This happened because the first column does not have a name.    
"","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"

In [None]:
#Lets clean that up by renaming it to 'name'
data = data.rename(columns={"Unnamed: 0": "name"})
data.head()

In [None]:
data.columns

In [None]:
data.shape

In [None]:
#To access a *series* (column), you can use either dictionary syntax or instance-variable syntax.
data.mpg

In [None]:
#You can get a numpy array of values from the Pandas Series:
data.mpg.values

In [None]:
data['name']

In [None]:
#And we can produce a histogram from these values
plt.hist(data.mpg.values, bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");

In [None]:
# you can get a histogram using panda
data.mpg.hist(bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");

In [None]:
#Pandas supports a dictionary like access to columns. 
data['mpg']

In [None]:
# We can also get sub-dataframes by choosing a set of series.  
data[['am', 'mpg']]

### Dataframes and Series

Now that we have our automobile data loaded as a dataframe, we'd like to be able to manipulate it, its series, and its sub-dataframes, say by calculating statistics and plotting distributions of features.  Like arrays and other containers, dataframes and series are listy, so we can apply the list operations we already know to these new containers.  Below we explore our dataframe and its properties, in the context of listiness.

In [None]:
#Listiness property 1: set length
print(data.shape)     # 12 columns, each of length 32
print(len(data))      # the number of rows in the dataframe, also the length of a series
print(len(data.mpg))  # the length of a series

In [None]:
#Listiness property 2: iteration via loops
# One consequence of the column-wise construction of dataframes is that you cannot easily iterate over the rows. 
# Instead, we iterate over the columns. 
for column in data: # iterating iterates over column names though, like a dictionary
    print(column)
    
# Or we can call the attribute `columns`.
data.columns

In [None]:
# We can iterate series in the same way that we iterate lists. 
# Here we print out the number of cylinders for each of the 32 vehicles. 
for element in data.cyl:
    print(element)
# you can iterate over rows by using `itertuples`.

In [None]:
#Listiness property 3: slice
print(list(data.index)) # index for the dataframe
data.cyl.index # index for the cyl series

There are two ways to index dataframes. The `loc` property indexes by label name, while `iloc` indexes by position in the index.  

In [None]:
data.iloc[0:3]

In [None]:
data.loc[0:7]

In [None]:
data.iloc[2:5, 1:4]

In [None]:
data.loc[7:9, ['mpg', 'cyl', 'disp']]

In [None]:
#add another column named 'maker' by parsing the first column
data['maker'] = data.name.apply(lambda x: x.split()[0])
data['maker']
#data.head()

**Let's make a toy dataframe from scratch.** 
- Create a series called `column_1` with entries 0, 1, 2, 3.
- Create a second series called `column_2` with entries 4, 5, 6, 7.
- Glue these series into a dataframe called `table`, where the first and second labelled column of the dataframe are `column_1` and `column_2`, respectively.  In the dataframe, `column_1` should be indexed as `col_1` and `column_2` should be indexed as `col_2`.
- Oops!  You've changed your mind about the index labels for the columns.  Use `rename` to rename `col_1` as `Col_1` and `col_2` as `Col_2`. 
- Rename `0` as `zero`, `1` as `one`, and so on.


In [None]:
column_1 = pd.Series(range(4))
column_2 = pd.Series(range(4,8))
table = pd.DataFrame({'col_1': column_1, 'col_2': column_2})
table = table.rename(columns={"col_1": "Col_1", "col_2":"Col_2"})
table
# try this
#table = table.rename({0: "zero", 1: "one", 2: "two", 3: "three"})
#table

### Data Types

Columns in a dataframe (series) come with their own types. Some data may be categorical, boolean, or integer, floating-point, and `object`. The latter is a catch-all for a string or anything Pandas cannot infer, for example, a column that contains data of mixed types. 

In [None]:
data.dtypes

In [None]:
# Categorical 
data.maker.unique()

In [None]:
data.maker.describe()

In [None]:
av_mpg = data.groupby('maker').mpg.mean()
av_mpg

In [None]:
#query
data.mpg < 100

In [None]:
data[data.mpg < 100].head() #try other queries

In [None]:
data.query("10 <= mpg <= 50").head()

In [None]:
data.sort_values(by="mpg").head(10)

In [None]:
data[data.gear == 4]

In [None]:
data.mpg.max()
# find sum, mean etc

In [None]:
data.groupby("maker").describe()

## Part 3:  Exploratory Data Analysis (EDA) - Global Properties

Below is a basic checklist for the early stages of exploratory data analysis in Python. While not universally applicable, the rubric covers patterns which recur in several data analysis contexts, so useful to keep it in mind when encountering a new dataset.
1. **Build** a DataFrame from the data (ideally, put all data in this object)
2. **Clean** the DataFrame. It should have the following properties:
    - Each row describes a single object
    - Each column describes a property of that object
    - Columns are numeric whenever appropriate
    - Columns contain atomic properties that cannot be further decomposed    
3. Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.
4. Explore **group properties**. Use groupby and small multiples to compare subsets of the data.

This process transforms the data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to follow-up on in subsequent analysis.
So far we have **built** the dataframe from automobile data, and carried out very minimal **cleaning** (renaming) in this dataframe.  We'll now visualize global  properties of our dataset.  We illustrate the concepts using `mpg`. 
### Histograms
A histogram shows the frequency distribution of a dataset.  Below is the distribution of `mpg`.  

In [None]:
data.mpg.plot.hist()  
plt.xlabel("mpg");

In [None]:
plt.hist(data.mpg, bins=20);
plt.xlabel("mpg");
plt.ylabel("Frequency")
plt.title("Miles per Gallon");

**EXERCISE**: Plot the distribution of the rear axle ratio (`drat`).  Label the axes accordingly and give the plot a title.    Calculate the mean of the distribution.

In [None]:
data.drat.plot.hist();
plt.xlabel("drat");
plt.ylabel("Frequency");
plt.title("Rear axle ratio (drat)");
print("mean = ", data.drat.mean())

### Scatter plots
We often want to see co-variation among our columns, for example, miles/gallon versus weight.  This can be done with a scatter plot. 

In [None]:
plt.scatter(data.wt, data.mpg); # you could also use plot and plot data as dots, try that.
plt.xlabel("weight");
plt.ylabel("miles per gallon"); # plt.show() if you run your Python program from a file. 
#plt.savefig('images/foo1.pdf')

Make a new dataframe with the columns of interest, sort it based on the x-value (`wt` in this case), and plot the sorted data.

In [None]:
sub_data = data[['wt', 'mpg']]
data_temp = sub_data.sort_values('wt')
plt.plot(data_temp.wt, data_temp.mpg, 'o-');
plt.xlabel("weight");
plt.ylabel("miles per gallon"); #plt.show()

**EXERCISE**: Create a scatter plot showing the co-variation between two columns of your choice. Label the axes. Comment on the scatter plot. 

In [None]:
#bar chart
av_mpg.plot(kind="barh")

In [None]:
#box plot
data.boxplot(column = 'mpg', by = 'am')

In [None]:
#pie chart
science = {
    'interest': ['Excited', 'Kind of interested', 'OK', 'Not great', 'Bored'],
    'before': [19, 25, 40, 5, 11],
    'after': [38, 30, 14, 6, 12]
}
dfscience = pd.DataFrame.from_dict(science).set_index("interest")[['before', 'after']]
fig, axs = plt.subplots(1,2, figsize = (8.5,4))
dfscience.before.plot(kind="pie", ax=axs[0], labels=None);
axs[0].legend(loc="upper left", ncol=5, labels=dfscience.index)
dfscience.after.plot(kind="pie", ax=axs[1], labels=None);