# Intro to Libraries - Numpy and Pandas
Agenda today:
- Getting started with Numpy
- Getting started with Pandas
    - select data using pandas
    - manipulate dataframes using pandas
    - aggregating in pandas
- Case Study - Exploratory data analysis

## Part I. Getting Started with Numpy

In [None]:
# starting off - let's talk about importing modules
# importing modules as naming conventions
import numpy as np

In [None]:
# we can use numpy arrays for strings and numbers, just like a normal list
names_list=['Bob','John','Sally']
names_array=np.char.array(['Bob','John','Sally']) #use numpy.array for numbers and numpy.char.array for strings
print(names_list)
print(names_array)

The __difference__ between a python list and a Numpy array is that list can only a mix of data types but array can only contain the same data type.

But what is the benefits of using NumPy array instead of the base python lists?
- Speed 
- Broadcasting Property

In [None]:
import time

size_of_seq = 100000

def pure_python_version():
    tic = time.time()
    X = range(size_of_seq)
    Y = range(size_of_seq)
    Z = [X[i] + Y[i] for i in range(len(X))]
    toc = time.time()
    return toc - tic

def numpy_version():
    tic = time.time()
    X = np.arange(size_of_seq)
    Y = np.arange(size_of_seq)
    Z = X + Y 
    toc = time.time()
    return toc - tic


t1 = pure_python_version()
t2 = numpy_version()
print("python: " + str(t1), "numpy: "+ str(t2))
print("Numpy is in this example " + str(t1/t2) + " times faster!")

__Various NumPy Methods__

- Indexing an array

In [None]:
my_temp_array = np.random.randint(100, size = 10)
my_temp_array[1:]

In [None]:
# creating a bunch of zeros 
np.zeros((10,10))


In [None]:
# take a few minutes and try out the following methods:
# np.ones()
# np.full()

__Array Math__

In [None]:
# addition 
arr1 = np.random.randint(10,size = 5)
arr2 = np.random.randint(10,size = 5)
print(arr1)
print(arr2)

In [None]:
arr1 + arr2

In [None]:
# let's see how list would handle this 


In [None]:
# dot product - numpy 


There are many many more awesome methods associated with NumPy that we will later use in this course. For now, let's move on to Pandas. 

## Part II. Getting Started with Pandas
<img src="attachment:Screen%20Shot%202019-04-25%20at%207.30.34%20AM.png" width="400">

In [None]:
# importing the necessary pacakges
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

`np.random.randint(low, high, size=)` returns an array of random integers from `low` to `high`, in the size of `size`

In [None]:
# creating a dataframe from scratch 
# syntax: pd.DataFrame()
# df 1 will be consist of index of numeric values and columns of numeric and categorical values
grades_dict = {'names':['jeremy o','jeremy c','remy','cristina'],
              'project_1':np.random.randint(80,95,4),
              'project_2':np.random.randint(75,90,4),
              'project_3':np.random.randint(85,92,4),
              'project_4':np.random.randint(88,94,4)}


In [None]:
grades_dict

In [None]:
# create a pandas dataframe using this dictionary 
grades = pd.DataFrame(grades_dict)

In [None]:
grades

In [None]:
# saving data to an excel file: 
grades.to_excel('grades.xlsx')

In [None]:
# saving data to csv
grades.to_csv('grades.csv')

Two types of objects in Pandas:
- __Series__: one-dimensional object capable of storing any data structure 
- __DataFrame__: two-dimensional object capable of storing any data structure

In [None]:
# example of using strings as index

# df can store a mix of datatypes
my_df2 = pd.DataFrame({"A":[1,2,4,5],
                     "B":[7,8,9,10],
                     "C":[3,4,6,10],
                     "D":[11,10,9,19],
                     "E":['str1','STR2','stR3','sTR4'],
                     "F":[np.nan,np.nan,'No','13']},
                     index = ['foo1','foo2','foo3','foo4'])

In [None]:
my_df2

In [None]:
# exploring pandas df methods
# viewing the dataframes
grades.head()

In [None]:
grades.tail(2)

In [None]:
grades.sample(1)

In [None]:
# examine the datatypes

In [None]:
# examine the shape

In [None]:
# counting unique values of the datatypes

In [None]:
# examine the unique elements 


In [None]:
# examine the number of unique elements

In [None]:
# get summary statistics 
grades.describe()

In [None]:
# get correlations
# how do you visualize correlation? - more on this later
grades.corr()

In [None]:
# looking for missing values 

In [None]:
# sorting values according to a column or multiple columns


## 2. Subset and Index

In [None]:
# selecting columns

In [None]:
# subsetting data - rows
# selecting rows that fit certain criteria -- boolean index
# only want grades that's equal to 80


In [None]:
# filtering by multiple criteria

In [None]:
# selecting rows or index by position -> iloc


In [None]:
# selecting rows or index by names -> loc

In [None]:
# selecting & slicing multiple columns


## 2. Manipulate the DataFrames 

Slight digress- the `lambda` function.
`lambda` function also known as anonymous function, is used for when you only need the function once. Used in conjunction with `map()`, `filter()`, or `reduce()`, you will be able to apply this anonymous function to a collection of objects.

In [None]:
# example of lambda functions
# define a function that increments by 10 and rewrite that in a lambda function

In [None]:
# lambda increment by 10

In [None]:
# define a function that compares the values of 2 integers

In [None]:
# lambda compares value of 2 integer 

In [None]:
# using lambda with map
#create a list that's the age of dogs
# and multiply it by 7 to get their age in human years using both normal function and lambda

In [None]:
# now let's take a look at how we can apply that to the dataframe
my_df2

Using similar logic of ~mapping~ `lambda` functions to dataframes, we can also ~apply~ lambda to both Series and Df:
- `apply()` : applies a lambda function to a column
- `applymap()`: applies a lambda function to the entire dataframe

In [None]:
# adding and manipulating the dataframes
# convert the column e all to lower came!


In [None]:
# dropping a column
#my_df2.drop('E', axis = 1)
# parameter: inplace

In [None]:
# more apply examples 
# get the average grades of the students
#grades['average'] = grades.mean(axis = 1)


In [None]:
# created a column called "pass" where the cell is where if the average score is above and no otherwise
#grades['pass'] = 

In [None]:
# pandas automatically broadcast - which is pretty awesome
#grades['awesome?'] = 'Yes'
#grades

In [None]:
# introduce some missing values
grades.iloc[3:4,3] = None
# how does axis parameter affect dropna?
grades.isnull().sum().any()

In [None]:
#grades

In [None]:
grades.dropna(axis = 0)
# inplace parameter, axis parameter 

## 3. Aggregating in Pandas - Groupby

In [None]:
# aggregating in dataframes
# create another dataframe 
new_grades = pd.DataFrame({"student":['Marc','Grace','Rene','Marc','Kenneth','Kevin','Kenneth','Grace'],
                          "project":['project_1','project_1','project_1','project_2','project_1','project_1','project_2','project_2'],
                          "grades":[80,75,89,99,89,95,97,80]})

In [None]:
new_grades.shape

In [None]:
new_grades

In [None]:
new_grades.student.nunique()

In [None]:
new_grades.project.unique()

In [None]:
# get the mean values of students' grade
new_grades.groupby('project').mean()

In [None]:
new_grades.groupby('student').grades.mean()

In [None]:
new_grades.groupby('student')

In [None]:
# what if I want to just return a new dataframe, grouped by some entries?
new_grades

In [None]:
grades_by_students = new_grades.sort_values(by='student').reset_index()
grades_by_student = grades_by_students.iloc[:,1:]


In [None]:
grades_by_student

In [None]:
# get the mean values for each projects
new_grades.groupby('project').grades.mean()

### Working with Multiple DataFrames
- Concatenating 


In [None]:
# concatenating is joining dataframes vertically
# create some data 
grades_dict = {"student_names":['Nan','Dan','Jason','Nan','Luke','Dan','Elena','Nan','Elena','Luke'],
         "project":["proj_1","proj_1","proj_1","proj_2","proj_1","proj_2","proj_1","proj_3","proj_3","proj_3"],
         "grades":np.random.randint(80,100,10)}
grades_1 = pd.DataFrame(grades_dict)
grades_1
grades_2 = pd.DataFrame({"student_names":["Alex","Miguel","Abdul","Karen","Miguel","Abdul","Karen","Alex","Caroline"],
                                "project":["proj_1","proj_1","proj_1","proj_1","proj_2","proj_2","proj_2","proj_2","proj_1"],
                                "grades":np.random.randint(80,100,9)})
grades_2

In [None]:
two_grades = pd.concat([grades_1,grades_2])
two_grades

#### Merging
__types of merges__: <br>
- inner merge: merging on only the overlaps
- outer merge: merging on everything
- left merge: merging on entries in the left df
- right merge: merge on entries on the right df

<img src="attachment:Screen%20Shot%202019-04-25%20at%2010.12.27%20AM.png" width="600">

In [None]:
import pandas as pd
import numpy as np
small_grades = pd.DataFrame({"students":["Rene","Kevin","Judah"],
                          "projects":[1,2,1],
                          "grades":np.random.randint(80,100,3)})
small_quiz = pd.DataFrame({"students":["Rima","Kevin","Rene"],
                            "quiz_score":np.random.randint(0,10,3)})

In [None]:
small_outer_merge = pd.merge(small_grades,small_quiz,how = 'outer')
small_outer_merge

In [None]:
small_inner_merge = pd.merge(small_grades,small_quiz, how = 'inner')
small_inner_merge

In [None]:
small_left_merge = pd.merge(small_grades,small_quiz,how = 'left')
small_left_merge

In [None]:
small_right_merge = pd.merge(small_grades,small_quiz,how = 'right')
small_right_merge

## 4. Case Study - Exploratory Data Analysis
We will use the adults dataset from census to perform exploratory data analysis. For more details on this dataset, check it out [here](http://cseweb.ucsd.edu/classes/sp15/cse190-c/reports/sp15/048.pdf)

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
adults = pd.read_csv(url,header=None)

In [None]:
adults.sample(5)

In [None]:
adults.head(5)

In [None]:
columns = ['age','work_class','fnlwgt','education','education_num','marital_status','occupation','relationship','race'
           ,'sex','capital_gain','capital_loss','hours_per_week','native_country','income']

In [None]:
adults.columns = columns

In [None]:
adults.head(3)

In [None]:
# examine the datatypes
adults.dtypes

In [None]:
# get datatypes counts
adults.dtypes.value_counts()

In [None]:
# filtering certain conditions
adults[(adults.education == 'Bachelors') & (adults.sex == 'Female')]
# how come thats empty?

In [None]:
# examine some cells
adults.iloc[2,5]
# turned out that the cell contains white space -- that's why we couldn't get the condition. We need to 

In [None]:
# strip whitespace -- using lambda function to remove whitespace for string objects


In [None]:
adults[(adults.education == 'Bachelors') & (adults.sex == 'Female')]


In [None]:
# sorting values
adults.sort_values(['fnlwgt'],ascending = False)

In [None]:
# get missing values
adults.isnull().sum().any()

In [None]:
adults.occupation.value_counts()

In [None]:
adults.marital_status.value_counts()

In [None]:
# visualize the results
adults.marital_status.value_counts().plot(kind = 'bar')

In [None]:
adults.age.describe()

In [None]:
adults.describe()

In [None]:
plt.hist(adults['age'],bins = 40, color = 'pink')
plt.title('Histogram for Age Distribution')

In [None]:
# Since the income is something we want to predict. Let's create a variable called:
# income_binary such that it is 1 if someone earns 50k and above and 0 otherwise



In [None]:
#adults.income_binary.value_counts().plot(kind = 'bar')
#plt.title('Income')

In [None]:
adults.education.value_counts().plot(kind = 'bar')

In [None]:
# examine correlations
adults.corr()

In [None]:
# summary statistics
adults.describe()

In [None]:
adults.groupby('marital_status').income_binary.value_counts().unstack('income_binary')

In [None]:
grouped_by_marital = adults.groupby('marital_status').income_binary.value_counts().unstack('income_binary')


In [None]:
# plot a stacked barplot 
grouped_by_marital.columns
grouped_by_marital[[0,1]].plot(kind = 'bar',stacked = True)

In [None]:
# trying more groupby stuff 
grouped_by_gender = adults.groupby('sex').income_binary.value_counts().unstack()

In [None]:
grouped_by_gender

In [None]:
grouped_by_gender.plot(kind = 'bar',stacked = True)

__Conclusion__:<br>
With the help of Pandas, we can manipulate, subset, and extract insights from paneled data with ease. Questions such as "what is the average age of women who get paid more than 50K?", "What is the most common occupation for achieving income higher than 50k?". Pandas allows us to perform exploratory data analysis and detect pattern in our data. Tomorrow, we will cover a deep dive into advanced pandas tabular data manipulation and working with multiple tables. 

Additional Resources:
- [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min)
- [Pandas Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization)
- [Data Analysis with Pandas by Kevin Markham](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y)