# Module 4 -  NumPy Wrap-up and Pandas

## Topic 1 - NumPy in 3D!

### Last week we did a bunch of NumPy 2D array work.  NumPy can be used in 3D and beyond.  To demonstrate, I created a (im)practical example using 3D 

### In this example we are loading some simple ASCII text art into a 3D array so we can output it based on a user's input

In [None]:
import numpy as np
my_text_npy = np.empty((26,8,8),dtype=str)
array_depth = 0
letter_complete = 0
with open("all letters in ascii.txt","r") as ascii_art:    
    line_counter = 0
    char_counter = 0
    for line in ascii_art:
        line = line.rstrip()
        for char in line:
            my_text_npy[array_depth,line_counter,char_counter]=char
            char_counter += 1
        if letter_complete < 7:
            line_counter += 1
            letter_complete += 1
            char_counter = 0
        else:
            array_depth += 1
            letter_complete = 0
            line_counter = 0
            char_counter = 0

for count in range(0,np.ma.size(my_text_npy,axis=0)):
    print(my_text_npy[count,:,:])

### With our array loaded 26x8x8 with awesome ASCII art, we can access it simply based on its position along index 0

In [None]:
dictionary_index = {"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9,"j":10,"k":11,"l":12,"m":13,"n":14,"o":15,"p":16,"q":17,"r":18,"s":19,"t":20,"u":21,"v":22,"w":23,"x":24,"y":25,"z":26}
user_input = input("Give me a name: ")
for letter in user_input:
    print(my_text_npy[dictionary_index[letter.lower()]-1,:,:])

## Topic 2: Pandas Dataframes

### Now that we've seen NumPy in action, we can start learning a bit about Pandas, which builds upon the NumPy library.  

### Stolen definition: Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 

### Let's use a simple dataset to learn how to some some work in Pandas dataframes

In [None]:
import pandas as pd

#### We can import this file, define the separator, and assign an index column

In [None]:
user_data = pd.read_csv("data_for_pandas.txt",sep='|',index_col='user_id')

#### Let's have a quick look at some of the rows

In [None]:
print(user_data.head(5))

In [None]:
print(user_data.tail(7))

#### We can also observe the datatypes and shape of the frame

In [None]:
print(user_data.dtypes)

In [None]:
print(user_data.shape[0])
print(user_data.shape[1])
print(user_data.columns)

In [None]:
print(user_data.index)

#### I can also look at information about specific columns

In [None]:
print(user_data.occupation.head(10))
print(user_data["occupation"].head(5))

In [None]:
print(user_data.occupation.value_counts())
print(user_data.occupation.value_counts().count()) #This line and the next are equivalent
print(user_data.occupation.nunique())

In [None]:
print(user_data.occupation.value_counts().head(5))

#### Describe is great for seeing overall data for the frame

In [None]:
user_data.describe()

In [None]:
user_data.describe(include = "all")

#### It's also easy to look at something simple like the mean value of a column

In [None]:
print(round(user_data.age.mean(),3))

## Topic 3: Pandas Dataframes Continued

### In this topic, you can have a short intro to Pandas series as well as how to merge dataframes

### Let's enrich our original dataset with some zip code data to get a more valuable set of data

In [None]:
all_zips = pd.read_csv("us_zip_code_city_county_state.csv",sep=",",index_col="zip_code")
print(all_zips.head())
print(all_zips.dtypes)

#### If you look at the datatype output, you can see that the datatype of the zip code field is int64.  This doesn't match our user_data zip_code field which has a type of "object".  We need to correct this to use it as a key to merge the two dataframes

In [None]:
user_data[["zip_code"]] = user_data[["zip_code"]].apply(pd.to_numeric,errors='coerce')
print(user_data.dtypes)
print(user_data.head(25))

In [None]:
all_zips.describe(include = "all")

#### Now that we have uniform datatypes, we can do a SQL type left join to combine the enriching zip code data with our dataset.

In [None]:
merged_df = user_data.merge(all_zips, how = 'left', on = ['zip_code'])
print(merged_df.head(5))
print(merged_df.shape[0])

In [None]:
print(merged_df.state.nunique())
print(merged_df.state.value_counts())

#### Unfortunately, all is not well and it doesn't immediately stand out.  We have rows in our data that are non-conforming so the merge didn't work in every case.  We can see the bad rows by finding any row that has NaN in it.

In [None]:
bad_values = merged_df[merged_df.isna().any(axis=1)]
print(bad_values)

## Topic 4: Pandas Statistics

### This topic introduces some simple statistics that can be done on our dataframes.  We will continue this next week with some more in-depth analysis

### Like NumPy, there are a ton of ways we can do statistical analysis of our dataframes.  I'm going to introduce some here and some more next week.

### For this example, I'm going to have a look at some red wine data

In [None]:
wine_data = pd.read_csv('Red Wine Quality.csv')
wine_data.head()
wine_data.shape

#### I can drop a column from this dataset that I don't want.  I'm going to get rid of the chlorides column based on its index position

In [None]:
wine_data = wine_data.drop(wine_data.columns[[5]],axis=1) #drop chlorides column
wine_data.head()

#### I can look at slices of columns and row, and even assign them to a new dataframe

In [None]:
wine_data.loc[:,["quality","pH","alcohol"]]

#### I can look at standard stats about whatever I want

In [None]:
print(wine_data["alcohol"].mean())
print(wine_data["quality"].std())

#### I can do mathematical work to the whole frame or specific columns/rows.  I'm going to make the quality of the wine be 100 based instead of 10 based by multiplying it by 10

In [None]:
wine_data["quality"]=wine_data["quality"]*10
wine_data.head()

#### Because I'm only intested in high quality wine, I'm going to just keep the above average wine rows.  I'll create a new dataframe to hold this data.

#### First I need to create a data series showing which rows have higher than average quality

In [None]:
high_quality_wine_bool = wine_data["quality"] > wine_data["quality"].mean()
print(high_quality_wine_bool)

#### I can now use this boolean series to create a dataframe that only contains the rows where the series is "True"

In [None]:
high_quality_wine = wine_data[high_quality_wine_bool]
high_quality_wine.describe()

## Topic 5: Pandas Time Series

## Pandas is used quite often for time series.  In the interest of time, we don't dig into this specifically but I've included a few links in Blackboard if you want to learn more about it.  In appropriate cases, it can be quite powerful