# Data Science Pandas

## Tasks Today:

1) <b>Pandas</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Tabular Data Structures <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - from_dict() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Accessing Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Indexing <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - df.loc <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - keys() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Slicing a DataFrame <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Built-In Methods <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - head() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - tail() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - describe() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - sort_values() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - .columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Filtration <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Conditionals <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Column Transformations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Generating a New Column w/Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - User Defined Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Aggregations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Type of groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - mean() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() w/Multiple Columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - drop_duplicates() <br>

## Pandas <br>

<p>Pandas is a flexible data analysis library built on top of NumPy that is excellent for working with tabular data. It is currently the de-facto standard for Python-based data analysis, and fluency in Pandas will do wonders for your productivity and frankly your resume. It is one of the fastest ways of getting from zero to answer in existence. </p>

<ul>
    <li>Pandas is a Python module, written in C. The Pandas module is a high performance, highly efficient, and high level data analysis library. It allows us to work with large sets of data called dataframes.</li>
    <li>Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)</li>
    <li>Dataframe = Spreadsheet (has column headers, index, etc.)</li>
</ul>

### Importing

In [1]:
# always use pd, standard for data science
import numpy as np
import pandas as pd

### Tabular data structures <br>
<p>The central object of study in Pandas is the DataFrame, which is a tabular data structure with rows and columns like an excel spreadsheet. The first point of discussion is the creation of dataframes both from native Python dictionaries, and text files through the Pandas I/O system.</p>

In [28]:
# define names list
names = [
    'Alice',
    'Barney',
    'Ted',
    'Marshall',
    'Lilly',
    'Robin',
    'Marvin',
    'Bob Saget',
    'Alvin'
]

# set the random seed
np.random.seed(321)

# make some random age data using np.random for ages 18 to 35
ages = np.random.randint(18, 35, len(names))
print(ages)

# create a dictionary using the names and ages arrays
people = {
    'names' : names,
    'ages' : ages
}

print(people)

[27 26 26 23 26 19 31 29 28]
{'names': ['Alice', 'Barney', 'Ted', 'Marshall', 'Lilly', 'Robin', 'Marvin', 'Bob Saget', 'Alvin'], 'ages': array([27, 26, 26, 23, 26, 19, 31, 29, 28])}


##### from_dict()

<p>Let's convert our not-so-useful-for-analysis dict into a Pandas dataframe. We can use the from_dict function to do this easily using Pandas:</p>

In [29]:
# check type too
df = pd.DataFrame.from_dict(people)

print(df)
print(type(df))

   ages      names
0    27      Alice
1    26     Barney
2    26        Ted
3    23   Marshall
4    26      Lilly
5    19      Robin
6    31     Marvin
7    29  Bob Saget
8    28      Alvin
<class 'pandas.core.frame.DataFrame'>


### Accessing Data <br>

##### Indexing

<p>You can directly select a column of a dataframe just like you would a dict. The result is a Pandas 'Series' object.</p>

In [30]:
# Even though they are more complex series objects,
# they still support the behavior of the underlying numpy arrays
# print(df['ages'])
# print(type(df['ages']))

# accessing an index value on a series object
df["ages"][4]

# df[4]   this will not work and produce an error, first bracket is always column header name

26

##### df.loc

<p>Along the horizontal dimension, rows of Pandas DataFrames are Row objects. You will notice there is a third column present in the DataFrame - this is the $\textit{index}$. It is automatically generated as a row number, but can be reassigned to a column of your choice using the DataFrame.set_index(colname) method. We can use it to access particular Pandas $\textit{rows}$, which are also Series objects:</p>

In [31]:
# get record at loc 0 and check type
record = df.loc[0]

# print(record)

# accessing a cell value by record then column name
print(df.loc[0]["names"])

Alice


##### keys()

In [32]:
# A series can support dictionary like features from Python when it's used as a DataFrame object
headers = df.keys()

print(headers)

for header in headers:
    print("{} \t {}".format(header, df.loc[0][header]))

Index(['ages', 'names'], dtype='object')
ages 	 27
names 	 Alice


##### Slicing a DataFrame

In [33]:
# Be aware, passing as a single int will be interpreted as a column key
print(df[2:5:2])

# df[2]

   ages  names
2    26    Ted
4    26  Lilly


### Built-In Methods <br>

<p>These are methods that are frequently used when using Pandas to make your life easier. It is possible to spend a whole week simply exploring the built-in functions supported by DataFrames in Pandas. Here however, we will simply highlight a few ones that might be useful, to give you an idea of what's possible out of the box with Pandas:</p>

##### head()

In [34]:
# can specify the number of rows to be displayed from the header
df.head(5)

Unnamed: 0,ages,names
0,27,Alice
1,26,Barney
2,26,Ted
3,23,Marshall
4,26,Lilly


##### tail()

In [35]:
# can specify the number of rows to be displayed from the footer
df.tail(5)

Unnamed: 0,ages,names
4,26,Lilly
5,19,Robin
6,31,Marvin
7,29,Bob Saget
8,28,Alvin


##### shape

In [36]:
# The dataframe has a shape property, just like a NumPy matrix. 
print(df.shape)

# It also has an overall length property corresponding to the number of rows.
print(len(df))

(9, 2)
9


##### describe() <br>

In [37]:
# Collect summary statistics in one line
df.describe()

Unnamed: 0,ages
count,9.0
mean,26.111111
std,3.480102
min,19.0
25%,26.0
50%,26.0
75%,28.0
max,31.0


##### sort_values()

In [38]:
# sort based on many labels, with left-to-right priority
# saved to df variable overwriting original
df = df.sort_values('ages')

df.head(10)

Unnamed: 0,ages,names
5,19,Robin
3,23,Marshall
1,26,Barney
2,26,Ted
4,26,Lilly
0,27,Alice
8,28,Alvin
7,29,Bob Saget
6,31,Marvin


##### .columns

In [39]:
# another way to access the headers/keys
print(df.columns)

Index(['ages', 'names'], dtype='object')


### Filtration <br>
<p>Let's look at how to filter dataframes for rows that fulfill a specific conditon.</p>

##### Conditionals

In [40]:
# Conditional boolean dataframe
can_drink = df['ages'] >= 21

print(can_drink)

5    False
3     True
1     True
2     True
4     True
0     True
8     True
7     True
6     True
Name: ages, dtype: bool


##### Subsetting

In [41]:
# Same as Numpy
df[df['ages'] >= 21]

Unnamed: 0,ages,names
3,23,Marshall
1,26,Barney
2,26,Ted
4,26,Lilly
0,27,Alice
8,28,Alvin
7,29,Bob Saget
6,31,Marvin


### Column Transformations <br>
<p>Rarely, if ever, will the columns in the original raw dataframe read from CSV or database table be the ones you actually need for your analysis. You will spend lots of time constantly transforming columns or groups of columns using general computational operations to produce new ones that are functions of the old ones. Pandas has full support for this: Consider the following dataframe containing membership term and renewal number for a group of customers:</p>

In [44]:
# Generate some fake data
np.random.seed(321)

customer_id = np.random.randint(1000, 1100, 100)
renewal_nbr = np.random.randint(0, 10, 100)        # number of times that someone has renewed their membership

d = {
    0: 0.5,
    1: 1
}

term_in_years = [ d[key] for key in np.random.randint(0, 2, 100) ]

# print(term_in_years)

# create dictionary - generally do this is in place without creating a variable
company_info = {
    'customer_id' : customer_id,
    'renewal_nbr' : renewal_nbr,
    'term_in_years' : term_in_years
}

# build a dataframe
customers = pd.DataFrame.from_dict(company_info)
customers.head(10)

Unnamed: 0,customer_id,renewal_nbr,term_in_years
0,1026,8,0.5
1,1031,3,0.5
2,1041,2,1.0
3,1072,2,0.5
4,1017,5,1.0
5,1040,7,0.5
6,1026,9,1.0
7,1088,0,0.5
8,1072,7,0.5
9,1083,4,0.5


##### Generating a New Column w/Data

In [45]:
# create new customer_tenure column for how long they've been with us
customers['customer_tenure'] = customers['renewal_nbr'] * customers['term_in_years']

customers.head(10)

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure
0,1026,8,0.5,4.0
1,1031,3,0.5,1.5
2,1041,2,1.0,2.0
3,1072,2,0.5,1.0
4,1017,5,1.0,5.0
5,1040,7,0.5,3.5
6,1026,9,1.0,9.0
7,1088,0,0.5,0.0
8,1072,7,0.5,3.5
9,1083,4,0.5,2.0


##### User Defined Function

<p>If what you want to do to a column that can't be represented by simple mathematical operations, you can write your own $\textit{user defined function}$ with the full customizability available in Python and any external Python packages, then map it directly onto a column. Let's add some ages to our customer dataframe, and then classify them into our custom defined grouping scheme:</p>

In [59]:
# use .apply to map over dataframe
# create new 'ages' column using randint 16 to 70, 100 records
np.random.seed(321)
customers['ages'] = np.random.randint(16, 70, 100)

# customers.head(10)

# create own function that returns proper age group
def ageGroup(age):
    if age >= 16 and age < 20:
        return 'Teenager'
    elif age >= 20 and age < 35:
        return 'Young Adult'
    elif age >=35 and age < 65:
        return 'Adult'
    else:
        return 'Senior'
    
    
# create new "age_group" column based on ages column with UDF mapping over
customers['age_group'] = customers['ages'].apply(ageGroup)

customers.head(10)

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group
0,1026,8,0.5,4.0,68,Senior
1,1031,3,0.5,1.5,42,Adult
2,1041,2,1.0,2.0,47,Adult
3,1072,2,0.5,1.0,57,Adult
4,1017,5,1.0,5.0,24,Young Adult
5,1040,7,0.5,3.5,33,Young Adult
6,1026,9,1.0,9.0,56,Adult
7,1088,0,0.5,0.0,42,Adult
8,1072,7,0.5,3.5,40,Adult
9,1083,4,0.5,2.0,68,Senior


<p>As a last example I'll show here how you would use a UDF that depends on $\textit{more than one}$ column:</p>

<li>UDF = User Defined Function</li>

In [60]:
# axis = horizontal application
# create a function called 'loyal_group' that checks each record and uses ages and tenure to return loyal or new + age group

def loyalGroup(record):
    age = record['ages']
    tenure = record['customer_tenure']
    
    # result will be 'loyal adult' or 'new senior'
    group = ageGroup(age)
    
    if tenure > 2:
        return f'Loyal {group}'
    else:
        return f'New {group}'
    
    
# create a "loyal_group" column that applies on the entire customers df using axis 1 for horizontal application
customers['loyal_group'] = customers.apply(loyalGroup, axis=1)     # axis = horizontal or vertical
customers.head(10)

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group,loyal_group
0,1026,8,0.5,4.0,68,Senior,Loyal Senior
1,1031,3,0.5,1.5,42,Adult,New Adult
2,1041,2,1.0,2.0,47,Adult,New Adult
3,1072,2,0.5,1.0,57,Adult,New Adult
4,1017,5,1.0,5.0,24,Young Adult,Loyal Young Adult
5,1040,7,0.5,3.5,33,Young Adult,Loyal Young Adult
6,1026,9,1.0,9.0,56,Adult,Loyal Adult
7,1088,0,0.5,0.0,42,Adult,New Adult
8,1072,7,0.5,3.5,40,Adult,Loyal Adult
9,1083,4,0.5,2.0,68,Senior,New Senior


### Aggregations <br>
<p>The raw data plus some transformations is generally only half the story. Your objective is to extract actual insights and actionable conclusions from the data, and that means reducing it from potentially billions of rows to some summary statistics via aggregation functions.</p>

##### groupby() <br>
<p>The .groupby() function is in some ways a 'master' aggregation.</p> 

<p>Data tables will usually reserve one column as a primary key - that is, a column for which each row has a unique value. This is to facilitate access to the exact rows of a data table that a user wants to view. The other columns will often have repeated values, such as the age groups in the above examples. We can use these columns to explore the data using the Pandas API:</p>

In [61]:
# also introducing .count() here, exact same as to how it's used in SQL

customers.groupby('age_group', as_index=False).count().head(10)

Unnamed: 0,age_group,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,loyal_group
0,Adult,54,54,54,54,54,54
1,Senior,9,9,9,9,9,9
2,Teenager,11,11,11,11,11,11
3,Young Adult,26,26,26,26,26,26


##### Type of groupby()

<p>The result is a new dataframe, the columns of which all contain the counts of the grouped field. Notice the type of a grouped dataframe:</p>

In [66]:
print('Type of groupby: {}'.format(type(customers.groupby("age_group", as_index=False))))

Type of groupby: <class 'pandas.core.groupby.DataFrameGroupBy'>


<p>This is because simply grouping data doesn't quite make sense without an aggregation function like count() to pair with. In this case, we're counting occurances of the grouped field, but that's not all we can do. We can take averages, standard deviations, mins, maxes and much more! Let's see how this works a bit more:</p>

##### mean()

In [67]:
# mean = average
customers.groupby('age_group', as_index=False).mean().head(10)

Unnamed: 0,age_group,customer_id,renewal_nbr,term_in_years,customer_tenure,ages
0,Adult,1049.074074,4.666667,0.703704,3.305556,48.481481
1,Senior,1047.111111,5.222222,0.611111,2.833333,67.0
2,Teenager,1052.727273,2.545455,0.818182,1.863636,17.272727
3,Young Adult,1032.807692,5.076923,0.711538,3.538462,26.0


##### groupby() w/Multiple Columns

<p>We end up with the average age of the groups in the last column, the average tenure in the tenure column, and so on and so forth. You can even split the groups more finely by passing a list of columns to group by:</p>

In [69]:
# using a list to state multiple columns to groupby
customers.groupby(['age_group', 'term_in_years'], as_index=False).mean().head(10)

Unnamed: 0,age_group,term_in_years,customer_id,renewal_nbr,customer_tenure,ages
0,Adult,0.5,1049.28125,4.59375,2.296875,48.125
1,Adult,1.0,1048.772727,4.772727,4.772727,49.0
2,Senior,0.5,1037.428571,6.142857,3.071429,66.714286
3,Senior,1.0,1081.0,2.0,2.0,68.0
4,Teenager,0.5,1051.25,3.75,1.875,16.5
5,Teenager,1.0,1053.571429,1.857143,1.857143,17.714286
6,Young Adult,0.5,1034.733333,5.333333,2.666667,26.4
7,Young Adult,1.0,1030.181818,4.727273,4.727273,25.454545


##### drop_duplicates()

<p>Drops all duplicates from the current dataframe</p>

In [75]:
# drops all duplicates from current dataframe
# set subset to customer_id as we don't want any similar id's, subset drops based on single column rather than exact row

customers = customers.drop_duplicates(subset='customer_id')

len(customers)

customers = customers.sort_values('customer_id').set_index('customer_id')

customers.head(10)

Unnamed: 0_level_0,renewal_nbr,term_in_years,customer_tenure,ages,age_group,loyal_group
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1000,8,0.5,4.0,26,Young Adult,Loyal Young Adult
1001,1,1.0,1.0,24,Young Adult,New Young Adult
1002,5,1.0,5.0,50,Adult,Loyal Adult
1003,1,0.5,0.5,34,Young Adult,New Young Adult
1004,4,1.0,4.0,17,Teenager,Loyal Teenager
1005,2,0.5,1.0,35,Adult,New Adult
1007,7,1.0,7.0,23,Young Adult,Loyal Young Adult
1009,3,0.5,1.5,21,Young Adult,New Young Adult
1010,0,0.5,0.0,47,Adult,New Adult
1011,9,0.5,4.5,59,Adult,Loyal Adult


<p>Thus the groupby operation allows you to rapidly make summary observations about the state of your entire dataset at flexible granularity. In one line above, we actually did something very complicated - that's the power of the dataframe. In fact, the process often consists of several iterative groupby operations, each revealing greater insight than the last - if you don't know where to start with a dataset, try a bunch of groupbys!</p>

# In-class Challenge: Analyzing the non-duplicate information <br>

<p>With the newly created customers dataframe, find out which age group has the higher tenure. Then figure out if that group has a higher average age for 6-month memberships or year long memberships. What was the biggest difference in averages between the duplicate dataframe and the non-duplicate dataframe.</p>

In [79]:
# question 1: which age group has the higher tenure
customers.groupby('age_group', as_index=False).mean().head(10)    # answer is young adults

# question 2: do young adults have higher average age for 6-month or 1-year memberships
customers.groupby(['age_group', 'term_in_years'], as_index=False).mean().head(10)      # answer is 6-months 28 to 24

# question 3: biggest difference in averages between the duplicate dataframe and the non-duplicate dataframe
customers.groupby('age_group', as_index=False).mean().head(10)       # teens renewal_nbr

Unnamed: 0,age_group,renewal_nbr,term_in_years,customer_tenure,ages
0,Adult,4.4,0.7,3.242857,49.0
1,Senior,4.5,0.583333,2.25,67.166667
2,Teenager,1.714286,0.857143,1.357143,17.0
3,Young Adult,4.588235,0.735294,3.441176,26.411765
