# Coding Temple's Data Analytics Course
---
## Advanced Python Day 4: Intro to Pandas

## Tasks Today:

0) <b>Pre-Work</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Numpy Random Sampling

1) <b>Pandas</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Tabular Data Structures <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - from_dict() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - read_csv() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) <b>In-Class Exercise #1</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Accessing Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Indexing <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - df.loc <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - keys() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Slicing a DataFrame <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Built-In Methods <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - head() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - tail() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - describe() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - sort_values() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - .columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) <b>In-Class Exercise #2</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Filtration <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Conditionals <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) Column Transformations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Generating a New Column w/Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - User Defined Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; i) Aggregations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Type of groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - mean() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() w/Multiple Columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - drop_duplicates() <br>

## Numpy Random Sampling

In [3]:
import numpy as np

np.random.seed(42) # My random numbers will never change. Allows for reproducibility

# A single call generates a single random number
print(f'Here is a random number: {np.random.uniform()}')

# You can also pass in some bounds(parameters)
print(f'Here is a random number between 1 and 1 million: {np.random.uniform(1,1e6)}')

# You can also generate a bunch of random numbers at one time
print(f'Here is a 3x3 Matrix with random numbers between 1 and 1 million: \n{np.random.uniform(1,1e6, (3,3))}')

# Instead of floats, let's generate some random integer values
print(f"Here are some random integers: {np.random.randint(0,10,4)}")

Here is a random number: 0.3745401188473625
Here is a random number between 1 and 1 million: 950714.3556956097
Here is a 3x3 Matrix with random numbers between 1 and 1 million: 
[[731994.20981746 598658.88553855 156019.4844238 ]
 [155995.36434168  58084.55408459 866176.27959879]
 [601115.4106282  708072.86972347  20585.47371131]]
Here are some random integers: [1 7 5 1]


## Pandas <br>

<p>Pandas is a flexible data analysis library built on top of NumPy that is excellent for working with tabular data. It is currently the de-facto standard for Python-based data analysis, and fluency in Pandas will do wonders for your productivity and frankly your resume. It is one of the fastest ways of getting from zero to answer in existence. </p>

<ul>
    <li>Pandas is a Python module, written in C. The Pandas module is a high performance, highly efficient, and high level data analysis library. It allows us to work with large sets of data called dataframes.</li>
    <li>Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)</li>
    <li>Dataframe = Spreadsheet (has column headers, index, etc.)</li>
</ul>

### Importing

In [4]:
# Pandas is aliased as pd across the board. This is the industry standard
import pandas as pd # This is the industry standard

### Tabular data structures <br>
<p>The central object of study in Pandas is the DataFrame, which is a tabular data structure with rows and columns like an excel spreadsheet. The first point of discussion is the creation of dataframes both from native Python dictionaries, and text files through the Pandas I/O system.</p>

In [4]:
names = [
    'Alice', 'Bob',
    'James', 'Beth',
    'John', 'Sally',
    'Richard', 'Lauren',
    'Brandon', 'Sabrina'
]
ages = np.random.randint(18,35,len(names))

my_dict = {
    'names' : names,
    'ages' : ages
}
my_dict

{'names': ['Alice',
  'Bob',
  'James',
  'Beth',
  'John',
  'Sally',
  'Richard',
  'Lauren',
  'Brandon',
  'Sabrina'],
 'ages': array([18, 29, 29, 34, 27, 33, 32, 32, 29, 20])}

##### from_dict()

<p>Let's convert our not-so-useful-for-analysis dict into a Pandas dataframe. We can use the from_dict function to do this easily using Pandas:</p>

In [15]:
# df is another industry standard, it is short for dataframe. When create a dataframe, you will see me use this QUITE often
df = pd.DataFrame.from_dict(my_dict)
df

Unnamed: 0,names,ages
0,Alice,18
1,Bob,29
2,James,29
3,Beth,34
4,John,27
5,Sally,33
6,Richard,32
7,Lauren,32
8,Brandon,29
9,Sabrina,20


##### read_csv()

In [7]:
# You can also bring in textual data and create a dataframe(table) out of it
df1 = pd.read_csv(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\student_notebooks\boston_marathon_2017.csv')
df1

Unnamed: 0,10K,15K,20K,25K,30K,35K,40K,5K,Age,Bib,...,Division,Gender,Half,M/F,Name,Number of Records,Official Time,Overall,Pace,State
0,12/30/1899 12:30:28 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:19 AM,12/30/1899 2:02:53 AM,12/30/1899 12:15:25 AM,24,11.0,...,1,1,12/30/1899 1:04:35 AM,M,"Kirui, Geoffrey",1,12/30/1899 2:09:37 AM,1,12/30/1899 12:04:57 AM,
1,12/30/1899 12:30:27 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:19 AM,12/30/1899 2:03:14 AM,12/30/1899 12:15:24 AM,30,17.0,...,2,2,12/30/1899 1:04:35 AM,M,"Rupp, Galen",1,12/30/1899 2:09:58 AM,2,12/30/1899 12:04:58 AM,OR
2,12/30/1899 12:30:29 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:16 AM,12/30/1899 1:17:00 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:31 AM,12/30/1899 2:03:38 AM,12/30/1899 12:15:25 AM,25,23.0,...,3,3,12/30/1899 1:04:36 AM,M,"Osako, Suguru",1,12/30/1899 2:10:28 AM,3,12/30/1899 12:04:59 AM,
3,12/30/1899 12:30:29 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:19 AM,12/30/1899 1:17:00 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:58 AM,12/30/1899 2:04:35 AM,12/30/1899 12:15:25 AM,32,21.0,...,4,4,12/30/1899 1:04:45 AM,M,"Biwott, Shadrack",1,12/30/1899 2:12:08 AM,4,12/30/1899 12:05:03 AM,CA
4,12/30/1899 12:30:28 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:41 AM,12/30/1899 2:05:00 AM,12/30/1899 12:15:25 AM,31,9.0,...,5,5,12/30/1899 1:04:35 AM,M,"Chebet, Wilson",1,12/30/1899 2:12:35 AM,5,12/30/1899 12:05:04 AM,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26405,12/30/1899 1:35:41 AM,12/30/1899 2:23:35 AM,12/30/1899 3:12:44 AM,12/30/1899 4:12:06 AM,12/30/1899 5:03:08 AM,12/30/1899 5:55:18 AM,12/30/1899 6:46:57 AM,12/30/1899 12:46:44 AM,61,25166.0,...,344,11972,12/30/1899 3:23:31 AM,F,"Steinbach, Paula Eyvonne",1,12/30/1899 7:09:39 AM,26407,12/30/1899 12:16:24 AM,CA
26406,12/30/1899 1:05:33 AM,12/30/1899 1:52:17 AM,12/30/1899 2:49:41 AM,12/30/1899 3:50:19 AM,12/30/1899 4:50:01 AM,12/30/1899 5:53:48 AM,12/30/1899 6:54:21 AM,12/30/1899 12:32:03 AM,25,25178.0,...,4774,14436,12/30/1899 3:00:26 AM,M,"Avelino, Andrew R.",1,12/30/1899 7:16:59 AM,26408,12/30/1899 12:16:40 AM,NC
26407,12/30/1899 1:43:36 AM,12/30/1899 2:32:36 AM,,12/30/1899 4:15:21 AM,12/30/1899 5:06:37 AM,12/30/1899 6:00:33 AM,12/30/1899 6:54:38 AM,12/30/1899 12:53:11 AM,57,27086.0,...,698,11973,12/30/1899 3:36:24 AM,F,"Hantel, Johanna",1,12/30/1899 7:19:37 AM,26409,12/30/1899 12:16:47 AM,PA
26408,12/30/1899 1:27:19 AM,12/30/1899 2:17:17 AM,12/30/1899 3:11:40 AM,12/30/1899 4:06:10 AM,12/30/1899 5:07:09 AM,12/30/1899 6:06:07 AM,12/30/1899 6:56:08 AM,12/30/1899 12:40:34 AM,64,25268.0,...,1043,14437,12/30/1899 3:22:30 AM,M,"Reilly, Bill",1,12/30/1899 7:20:44 AM,26410,12/30/1899 12:16:49 AM,NY


### In-Class Exercise #1 - Read in Boston Red Sox Hitting Data <br>
<p>Use the pandas read_csv() method to read in the statistics from the two files yesterday.</p>

In [5]:
bs2017 = pd.read_csv(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\redsox_2017_hitting.txt')
bs2018 = pd.read_csv(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\redsox_2018_hitting.txt')

display(bs2017)
display(bs2018)

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Christian Vazquez,26,99,345,324,43,94,18,...,0.33,0.404,0.735,91,131,14,3,0,1,0
1,2,1B,Mitch Moreland,31,149,576,508,73,125,34,...,0.326,0.443,0.769,99,225,14,6,0,5,6
2,3,2B,Dustin Pedroia,33,105,463,406,46,119,19,...,0.369,0.392,0.76,100,159,11,2,2,4,4
3,4,SS,Xander Bogaerts,24,148,635,571,94,156,32,...,0.343,0.403,0.746,95,230,17,6,0,2,6
4,5,3B,Rafael Devers,20,58,240,222,34,63,14,...,0.338,0.482,0.819,111,107,5,0,0,0,3
5,6,LF,Andrew Benintendi,22,151,658,573,84,155,26,...,0.352,0.424,0.776,102,243,16,6,1,8,7
6,7,CF,Jackie Bradley Jr.,27,133,541,482,58,118,19,...,0.323,0.402,0.726,89,194,8,9,0,2,4
7,8,RF,Mookie Betts,24,153,712,628,101,166,46,...,0.344,0.459,0.803,108,288,9,2,0,5,9
8,9,DH,Hanley Ramirez,33,133,553,496,58,120,24,...,0.32,0.429,0.75,94,213,15,6,0,0,8
9,10,C,Sandy Leon,28,85,301,271,32,61,14,...,0.29,0.354,0.644,68,96,5,1,1,3,1


Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Sandy Leon,29,89,288,265,30,47,12,...,0.232,0.279,0.511,37,74,6,4,3,1,0
1,2,1B,Mitch Moreland,32,124,459,404,57,99,23,...,0.325,0.433,0.758,102,175,12,0,0,5,2
2,3,2B,Eduardo Nunez,31,127,502,480,56,127,23,...,0.289,0.388,0.677,81,186,17,2,1,3,0
3,4,SS,Xander Bogaerts,25,136,580,513,72,148,45,...,0.36,0.522,0.883,135,268,14,6,0,6,4
4,5,3B,Rafael Devers,21,121,490,450,59,108,24,...,0.298,0.433,0.731,94,195,9,0,0,2,6
5,6,LF,Andrew Benintendi,23,148,661,579,103,168,41,...,0.366,0.465,0.83,123,269,9,2,2,7,1
6,7,CF,Jackie Bradley Jr.,28,144,535,474,76,111,33,...,0.314,0.403,0.717,92,191,6,11,0,4,3
7,8,RF,Mookie Betts,25,136,614,520,129,180,47,...,0.438,0.64,1.078,186,333,5,8,0,5,8
8,9,DH,J.D. Martinez,30,150,649,569,111,188,37,...,0.402,0.629,1.031,173,358,19,4,0,7,11
9,10,MI,Brock Holt,30,109,367,321,41,89,18,...,0.362,0.411,0.774,109,132,7,7,0,2,2


### Accessing Data <br>

##### Indexing

<p>You can directly select a column of a dataframe just like you would a dict. The result is a Pandas 'Series' object.</p>

In [32]:
# These are the same object, different names
# Both are 1-D data structures, or vectors
print(df['names'])
print(my_dict['names'])

# Let's take a look at the type of data structure we have
print(type(df['names']))
print(type(my_dict['names']))

# The difference between a vector and a matrix, using shape
print(df.shape)
print(df.names.shape)

# Index a series based on the numeric value of the index
print(df.names[0])
print(df.names[5])

# We can also index into the dataframe object itself using numerical indexing
# We can point to the dataframe object, then ask for the row.
df['ages'][0]

# What if I wanted to return multiple columns?
df[['ages', 'names']]

# Pandas makes it really easy to change the TYPE of the data with no problems at all
df['ages_float'] = df['ages'].astype(float)
df['avg_age_random_calculation'] = df['ages'] / df['ages_float']
df

0      Alice
1        Bob
2      James
3       Beth
4       John
5      Sally
6    Richard
7     Lauren
8    Brandon
9    Sabrina
Name: names, dtype: object
['Alice', 'Bob', 'James', 'Beth', 'John', 'Sally', 'Richard', 'Lauren', 'Brandon', 'Sabrina']
<class 'pandas.core.series.Series'>
<class 'list'>
(10, 3)
(10,)
Alice
Sally


Unnamed: 0,names,ages,ages_float,avg_age_random_calculation
0,Alice,18,18.0,1.0
1,Bob,29,29.0,1.0
2,James,29,29.0,1.0
3,Beth,34,34.0,1.0
4,John,27,27.0,1.0
5,Sally,33,33.0,1.0
6,Richard,32,32.0,1.0
7,Lauren,32,32.0,1.0
8,Brandon,29,29.0,1.0
9,Sabrina,20,20.0,1.0


##### df.loc

Along the horizontal dimension, rows of Pandas DataFrames are Row objects. You will notice there is a third column present in the DataFrame - this is the $\textit{index}$. It is automatically generated as a row number, but can be reassigned to a column of your choice using the DataFrame.set_index(colname) method. We can use it to access particular Pandas $\textit{rows}$, which are also Series objects:

In [43]:
# df.set_index('names')
# Grab the first row of data using the index for that row
print(df.loc[0])
print(type(df.loc[0]))

# Grab multiple values using the .loc function
# Multi-level indexing, with nested lists
print(df.loc[[0,1,2]][['names', 'ages_float']])
print(df.loc[0:2][:])

# Use df.loc to set a user-defined index
# This is using multi-level indexing with a nested list
new_df = df.loc[[0,1,2]][:].set_index('names')
display(new_df)

# However, slicing will always be easier
display(df.loc[0:3].set_index('names'))

names                         Alice
ages                             18
ages_float                     18.0
avg_age_random_calculation      1.0
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
   names  ages_float
0  Alice        18.0
1    Bob        29.0
2  James        29.0
   names  ages  ages_float  avg_age_random_calculation
0  Alice    18        18.0                         1.0
1    Bob    29        29.0                         1.0
2  James    29        29.0                         1.0


Unnamed: 0_level_0,ages,ages_float,avg_age_random_calculation
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,18,18.0,1.0
Bob,29,29.0,1.0
James,29,29.0,1.0


Unnamed: 0_level_0,ages,ages_float,avg_age_random_calculation
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,18,18.0,1.0
Bob,29,29.0,1.0
James,29,29.0,1.0
Beth,34,34.0,1.0


In [45]:
# Conditional statement:
new_df[new_df.ages == 29]

Unnamed: 0_level_0,ages,ages_float,avg_age_random_calculation
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,29,29.0,1.0
James,29,29.0,1.0


##### keys()

In [50]:
my_dict.keys()

# Keys also works to gather the columns of a dataframe object:
print(df.keys())
print(bs2017.keys())
# We can also access this information using the .columns attribute
print(df.columns)
print(bs2018.columns)

# We can also cast these objects to a list!
bs2018.columns.to_list()

Index(['names', 'ages', 'ages_float', 'avg_age_random_calculation'], dtype='object')
Index(['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB',
       'GDP', 'HBP', 'SH', 'SF', 'IBB'],
      dtype='object')
Index(['names', 'ages', 'ages_float', 'avg_age_random_calculation'], dtype='object')
Index(['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB',
       'GDP', 'HBP', 'SH', 'SF', 'IBB'],
      dtype='object')


['Rk',
 'Pos',
 'Name',
 'Age',
 'G',
 'PA',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'SB',
 'CS',
 'BB',
 'SO',
 'BA',
 'OBP',
 'SLG',
 'OPS',
 'OPS+',
 'TB',
 'GDP',
 'HBP',
 'SH',
 'SF',
 'IBB']

##### Slicing a DataFrame

In [55]:
print(f'Full Dataframe:\n{df}')
print(f'\n Full Dataframe as a sliced copy: \n {df[:]}')
print(f'\nFirst row of data: \n{df[:1]}')
print(f'\nFifth row of data to the end of the dataset: \n{df[5:]}')
print(f'\n2nd through the 5th row of data: \n{df[2:6]}')

Full Dataframe:
     names  ages  ages_float  avg_age_random_calculation
0    Alice    18        18.0                         1.0
1      Bob    29        29.0                         1.0
2    James    29        29.0                         1.0
3     Beth    34        34.0                         1.0
4     John    27        27.0                         1.0
5    Sally    33        33.0                         1.0
6  Richard    32        32.0                         1.0
7   Lauren    32        32.0                         1.0
8  Brandon    29        29.0                         1.0
9  Sabrina    20        20.0                         1.0

 Full Dataframe as a sliced copy: 
      names  ages  ages_float  avg_age_random_calculation
0    Alice    18        18.0                         1.0
1      Bob    29        29.0                         1.0
2    James    29        29.0                         1.0
3     Beth    34        34.0                         1.0
4     John    27        27.0       

### Built-In Methods <br>

<p>These are methods that are frequently used when using Pandas to make your life easier. It is possible to spend a whole week simply exploring the built-in functions supported by DataFrames in Pandas. Here however, we will simply highlight a few ones that might be useful, to give you an idea of what's possible out of the box with Pandas:</p>

##### .head()

In [57]:
#df.head() -- Allows us to view the beginning of a dataframe. The .head() function can also accept an integer value to return 
# More data than the default 5 rows
display(bs2017.head())

# Edit the number of rows returned
display(bs2017.head(20))

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Christian Vazquez,26,99,345,324,43,94,18,...,0.33,0.404,0.735,91,131,14,3,0,1,0
1,2,1B,Mitch Moreland,31,149,576,508,73,125,34,...,0.326,0.443,0.769,99,225,14,6,0,5,6
2,3,2B,Dustin Pedroia,33,105,463,406,46,119,19,...,0.369,0.392,0.76,100,159,11,2,2,4,4
3,4,SS,Xander Bogaerts,24,148,635,571,94,156,32,...,0.343,0.403,0.746,95,230,17,6,0,2,6
4,5,3B,Rafael Devers,20,58,240,222,34,63,14,...,0.338,0.482,0.819,111,107,5,0,0,0,3


Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Christian Vazquez,26,99,345,324,43,94,18,...,0.33,0.404,0.735,91,131,14,3,0,1,0
1,2,1B,Mitch Moreland,31,149,576,508,73,125,34,...,0.326,0.443,0.769,99,225,14,6,0,5,6
2,3,2B,Dustin Pedroia,33,105,463,406,46,119,19,...,0.369,0.392,0.76,100,159,11,2,2,4,4
3,4,SS,Xander Bogaerts,24,148,635,571,94,156,32,...,0.343,0.403,0.746,95,230,17,6,0,2,6
4,5,3B,Rafael Devers,20,58,240,222,34,63,14,...,0.338,0.482,0.819,111,107,5,0,0,0,3
5,6,LF,Andrew Benintendi,22,151,658,573,84,155,26,...,0.352,0.424,0.776,102,243,16,6,1,8,7
6,7,CF,Jackie Bradley Jr.,27,133,541,482,58,118,19,...,0.323,0.402,0.726,89,194,8,9,0,2,4
7,8,RF,Mookie Betts,24,153,712,628,101,166,46,...,0.344,0.459,0.803,108,288,9,2,0,5,9
8,9,DH,Hanley Ramirez,33,133,553,496,58,120,24,...,0.32,0.429,0.75,94,213,15,6,0,0,8
9,10,C,Sandy Leon,28,85,301,271,32,61,14,...,0.29,0.354,0.644,68,96,5,1,1,3,1


##### .tail()

In [58]:
# Same thing as above, but unlike a head, the tail starts at the end
# Takes the same positional arguments as our .head() method.
display(bs2017.tail())

# Edit number of rows returned to us:
display(bs2017.tail(20))

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
18,19,IF,Marco Hernandez,24,21,60,58,7,16,3,...,0.3,0.328,0.628,65,19,0,1,0,0,0
19,20,UT,Rajai Davis,36,17,38,36,7,9,2,...,0.289,0.306,0.595,56,11,2,1,0,0,0
20,21,UT,Steve Selsky,27,8,9,9,0,1,1,...,0.111,0.222,0.333,-16,2,0,0,0,0,0
21,22,UT,Blake Swihart,25,6,7,5,1,1,0,...,0.429,0.2,0.629,74,1,0,0,0,0,0
22,23,2B,Chase d'Arnaud,30,2,1,1,2,1,0,...,1.0,1.0,2.0,428,1,0,0,0,0,0


Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
3,4,SS,Xander Bogaerts,24,148,635,571,94,156,32,...,0.343,0.403,0.746,95,230,17,6,0,2,6
4,5,3B,Rafael Devers,20,58,240,222,34,63,14,...,0.338,0.482,0.819,111,107,5,0,0,0,3
5,6,LF,Andrew Benintendi,22,151,658,573,84,155,26,...,0.352,0.424,0.776,102,243,16,6,1,8,7
6,7,CF,Jackie Bradley Jr.,27,133,541,482,58,118,19,...,0.323,0.402,0.726,89,194,8,9,0,2,4
7,8,RF,Mookie Betts,24,153,712,628,101,166,46,...,0.344,0.459,0.803,108,288,9,2,0,5,9
8,9,DH,Hanley Ramirez,33,133,553,496,58,120,24,...,0.32,0.429,0.75,94,213,15,6,0,0,8
9,10,C,Sandy Leon,28,85,301,271,32,61,14,...,0.29,0.354,0.644,68,96,5,1,1,3,1
10,11,UT,Chris Young,33,90,276,243,30,57,12,...,0.322,0.387,0.709,85,94,4,2,0,1,0
11,12,3B,Deven Marrero,26,71,188,171,32,36,9,...,0.259,0.333,0.593,54,57,8,0,3,2,0
12,13,2B,Eduardo Nunez,30,38,173,165,23,53,12,...,0.353,0.539,0.892,128,89,3,2,0,0,0


##### .describe()
Probably one of the most important methods to understand. .describe collects all summary statistics in one dataframe object, allowing easy viewing and understanding
Of the count of values, mean, standard deviation, minimum value, maximum value, and inner-quartile ranges

In [61]:
# Describe by default, applies summary statistics to the numerical columns present within a dataframe object.
display(df.describe())

# Describe can also be run on a single column, or a Series object
df.ages.describe()

Unnamed: 0,ages,ages_float,avg_age_random_calculation
count,10.0,10.0,10.0
mean,28.3,28.3,1.0
std,5.375872,5.375872,0.0
min,18.0,18.0,1.0
25%,27.5,27.5,1.0
50%,29.0,29.0,1.0
75%,32.0,32.0,1.0
max,34.0,34.0,1.0


count    10.000000
mean     28.300000
std       5.375872
min      18.000000
25%      27.500000
50%      29.000000
75%      32.000000
max      34.000000
Name: ages, dtype: float64

In [62]:
# Describe to look at the summary statistics of the object columns
# To do so, we can use the exclude parameter within the .describe() method
bs2017.describe(exclude='number')

Unnamed: 0,Pos,Name
count,23,23
unique,11,23
top,UT,Christian Vazquez
freq,6,1


##### .sort_values()

In [65]:
# sort_values is used to sort the values based on a label
display(df)
display(df.sort_values('names'))

# What if I want to sort my values and then reset the index?
display(df.sort_values('names', ascending=False).reset_index(drop=True))

Unnamed: 0,names,ages,ages_float,avg_age_random_calculation
0,Alice,18,18.0,1.0
1,Bob,29,29.0,1.0
2,James,29,29.0,1.0
3,Beth,34,34.0,1.0
4,John,27,27.0,1.0
5,Sally,33,33.0,1.0
6,Richard,32,32.0,1.0
7,Lauren,32,32.0,1.0
8,Brandon,29,29.0,1.0
9,Sabrina,20,20.0,1.0


Unnamed: 0,names,ages,ages_float,avg_age_random_calculation
0,Alice,18,18.0,1.0
3,Beth,34,34.0,1.0
1,Bob,29,29.0,1.0
8,Brandon,29,29.0,1.0
2,James,29,29.0,1.0
4,John,27,27.0,1.0
7,Lauren,32,32.0,1.0
6,Richard,32,32.0,1.0
9,Sabrina,20,20.0,1.0
5,Sally,33,33.0,1.0


Unnamed: 0,names,ages,ages_float,avg_age_random_calculation
0,Sally,33,33.0,1.0
1,Sabrina,20,20.0,1.0
2,Richard,32,32.0,1.0
3,Lauren,32,32.0,1.0
4,John,27,27.0,1.0
5,James,29,29.0,1.0
6,Brandon,29,29.0,1.0
7,Bob,29,29.0,1.0
8,Beth,34,34.0,1.0
9,Alice,18,18.0,1.0


##### .isnull()

This method applies a boolean mask across the DataFrame object, returning True for any NaN(Null) values and False for all others.

In [71]:
# Method to return the boolean mask across the entire dataframe object
bs2017.isnull()

# How can we view if there is a null value in a specific column?
bs2017.isnull().sum()
bs2017['Rk'][bs2017['Rk'] == 'NaN']

Series([], Name: Rk, dtype: int64)

##### .nunique()

In [74]:
# Provides us with a total number of unique values that are present within a dataframe object
print(bs2017.nunique())

# We can also call this function on a specific column or Series object as well
# The return from this is a singular integer value
print('\n',bs2017['Name'].nunique())

Rk      23
Pos     11
Name    23
Age     13
G       22
PA      23
AB      23
R       18
H       21
2B      15
3B       5
HR      11
RBI     16
SB      12
CS       5
BB      19
SO      22
BA      22
OBP     22
SLG     22
OPS     22
OPS+    23
TB      21
GDP     13
HBP      6
SH       4
SF       7
IBB      8
dtype: int64

 23


##### .info()

In [75]:
# To provide useful information on each column present within a dataframe object
# This includes stuff like the number of null values, the data types, and the count of rows/columns
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   names                       10 non-null     object 
 1   ages                        10 non-null     int32  
 2   ages_float                  10 non-null     float64
 3   avg_age_random_calculation  10 non-null     float64
dtypes: float64(2), int32(1), object(1)
memory usage: 412.0+ bytes


##### .shape

In [78]:
# This is an attribute of the dataframe object
# A dataframe is nothing more than a class that is built in the Pandas library to handle data
# We call attributes of a class using dot notation.
print(df.shape)

# The series class has the same attribute and can be accessed in the exact same way!
df['ages_float'].shape


(10, 4)


(10,)

##### .columns

In [81]:
# .columns is another attribute we can call on a dataframe object
# This will show all the columns present within a dataframe
print(df.columns)

# What kind of datatype is this object?
print(type(df.columns))

# Another cool thing about the columns is that each is an attribute of the dataframe
# Meaning we can call to them using our dot notation, as we have already seen.
print(df.ages)


Index(['names', 'ages', 'ages_float', 'avg_age_random_calculation'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
0    18
1    29
2    29
3    34
4    27
5    33
6    32
7    32
8    29
9    20
Name: ages, dtype: int32


### In-Class Exercise #2 - Describe & Sort Boston Red Sox Hitting Data <br>
<p>Take the data that you read in earlier from the Red Sox csv's and use the describe method to understand the data better. Compare the two years and decide which team is having the better year. Then sort the values based on Batting Average.</p>

In [85]:
print(f'The difference between the years 2017 and 2018 total bases is: {sum(bs2017["TB"]) - sum(bs2018["TB"])}')
print(f"2018 was a better year for total bases for the Boston Red Sox by: {sum(bs2018['TB']) - sum(bs2017['TB'])}")

# sort the values based on a column:
display(bs2018.sort_values('BA'))

The difference between the years 2017 and 2018 total bases is: -242
2018 was a better year for total bases for the Boston Red Sox by: 242


Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
18,19,2B,Dustin Pedroia,34,3,13,11,1,1,0,...,0.231,0.091,0.322,-7,1,0,0,0,0,0
17,18,UT,Brandon Phillips,37,9,27,23,4,3,0,...,0.259,0.261,0.52,42,6,1,0,0,0,0
19,20,C,Dan Butler,31,2,7,6,0,1,0,...,0.143,0.167,0.31,-17,1,0,0,0,1,0
0,1,C,Sandy Leon,29,89,288,265,30,47,12,...,0.232,0.279,0.511,37,74,6,4,3,1,0
10,11,C,Christian Vazquez,27,80,269,251,24,52,10,...,0.257,0.283,0.54,46,71,5,4,1,0,1
16,17,UT,Sam Travis,24,19,38,36,5,8,3,...,0.263,0.389,0.652,73,14,1,0,0,0,0
11,12,UT,Blake Swihart,26,82,207,192,28,44,10,...,0.285,0.328,0.613,65,63,4,0,0,0,0
6,7,CF,Jackie Bradley Jr.,28,144,535,474,76,111,33,...,0.314,0.403,0.717,92,191,6,11,0,4,3
4,5,3B,Rafael Devers,21,121,490,450,59,108,24,...,0.298,0.433,0.731,94,195,9,0,0,2,6
14,15,2B,Ian Kinsler,36,37,143,132,17,32,6,...,0.294,0.311,0.604,64,41,5,0,0,1,0


### Filtration <br>
<p>Let's look at how to filter dataframes for rows that fulfill a specific conditon.</p>

##### Conditionals

In [90]:
# Boolean mask (conditional statement) returns True or False
conditional_mask = df['ages'] >= 25
conditional_mask

0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: ages, dtype: bool

##### Subsetting

In [92]:
display(df[conditional_mask])

df['names'][conditional_mask]

Unnamed: 0,names,ages,ages_float,avg_age_random_calculation
1,Bob,29,29.0,1.0
2,James,29,29.0,1.0
3,Beth,34,34.0,1.0
4,John,27,27.0,1.0
5,Sally,33,33.0,1.0
6,Richard,32,32.0,1.0
7,Lauren,32,32.0,1.0
8,Brandon,29,29.0,1.0


1        Bob
2      James
3       Beth
4       John
5      Sally
6    Richard
7     Lauren
8    Brandon
Name: names, dtype: object

### Column Transformations <br>
<p>Rarely, if ever, will the columns in the original raw dataframe read from CSV or database table be the ones you actually need for your analysis. You will spend lots of time constantly transforming columns or groups of columns using general computational operations to produce new ones that are functions of the old ones. Pandas has full support for this: Consider the following dataframe containing membership term and renewal number for a group of customers:</p>

In [13]:
# Generate fake data
np.random.seed(42) # Random seed to keep data the same
customer_id = np.random.randint(1000,1100,10)
renewal_hbr = np.random.randint(0,10,10)
customer_dict = {1: 0.5, 0: 1}

# Example of feature engineering. We are going to create a new feature withing the data form another source of data
term_in_years = [customer_dict[key] for key in np.random.randint(0,2,10)]

# Combine them all into a single dictonary object:
random_data = {
    'customer_id' : customer_id.astype(str),
    'Renewal HBR' : renewal_hbr,
    'term_in_years' : term_in_years
}

# Now, we can create a dataframe from this!
customers = pd.DataFrame.from_dict(random_data)
customers

Unnamed: 0,customer_id,Renewal HBR,term_in_years
0,1051,7,0.5
1,1092,4,0.5
2,1014,3,0.5
3,1071,7,0.5
4,1060,7,0.5
5,1020,2,0.5
6,1082,5,1.0
7,1086,4,1.0
8,1074,1,0.5
9,1074,7,0.5


In [14]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    10 non-null     object 
 1   Renewal HBR    10 non-null     int32  
 2   term_in_years  10 non-null     float64
dtypes: float64(1), int32(1), object(1)
memory usage: 328.0+ bytes


In [15]:
# An example of a column tranformation.
# This is important to the world of DA
customers['customer_id'] = customers['customer_id'].astype(int)

##### Feature Engineering a New Column w/Data

In [18]:
# dataframe['column_name'] = Some calculation done on another column or list of equal length
customers['customer_tenjke'] = customers['Renewal HBR'] * customers['term_in_years']

# What if I wanted to augment the ID column to be the id + 1?
# Pandas will iterate over the column FOR YOU
customers['aug_id'] = customers['customer_id'] + 1
display(customers)

Unnamed: 0,customer_id,Renewal HBR,term_in_years,customer_tenure,aug_id,customer_tene,customer_tenjke
0,1051,7,0.5,3.5,1052,3.5,3.5
1,1092,4,0.5,2.0,1093,2.0,2.0
2,1014,3,0.5,1.5,1015,1.5,1.5
3,1071,7,0.5,3.5,1072,3.5,3.5
4,1060,7,0.5,3.5,1061,3.5,3.5
5,1020,2,0.5,1.0,1021,1.0,1.0
6,1082,5,1.0,5.0,1083,5.0,5.0
7,1086,4,1.0,4.0,1087,4.0,4.0
8,1074,1,0.5,0.5,1075,0.5,0.5
9,1074,7,0.5,3.5,1075,3.5,3.5


#### Dropping a column

In [1]:
# Creating a duplicate column in my dataset:
customers['customer_id_1'] = customers['customer_id']
display(customers)

# Let's go ahead and try to drop this column!
customers.drop('customer_id_1')

NameError: name 'customers' is not defined

##### Axis? What's That all about?
##### This: 
![atext](https://i.stack.imgur.com/dcoE3.jpg)  
If you check the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html), you'll see that DataFrame.drop() has "axis = 0" as a default. We need to explicitly (remember the Zen of Python?) tell pandas to look for the column we want to drop from the column axis, which is column 1.

In [103]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [107]:
# Now that we are armed with this info, let's try again!
customers.drop('customer_id_1', axis = 1)

# Let's double check our dataframe!
customers

# How can I fix it?
# Preferred method: save the dataframe to another dataframe variable so that you don't overwrite your source of truth.
df2 = customers.drop('customer_id_1', axis = 1)
display(df2)

# The other method is to drop from the original and overwrite the dataframe in-place.
customers.drop('customer_id_1', axis = 1, inplace = True)
display(customers)

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id
0,1051,7,0.5,3.5,1052
1,1092,4,0.5,2.0,1093
2,1014,3,0.5,1.5,1015
3,1071,7,0.5,3.5,1072
4,1060,7,0.5,3.5,1061
5,1020,2,0.5,1.0,1021
6,1082,5,1.0,5.0,1083
7,1086,4,1.0,4.0,1087
8,1074,1,0.5,0.5,1075
9,1074,7,0.5,3.5,1075


Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id
0,1051,7,0.5,3.5,1052
1,1092,4,0.5,2.0,1093
2,1014,3,0.5,1.5,1015
3,1071,7,0.5,3.5,1072
4,1060,7,0.5,3.5,1061
5,1020,2,0.5,1.0,1021
6,1082,5,1.0,5.0,1083
7,1086,4,1.0,4.0,1087
8,1074,1,0.5,0.5,1075
9,1074,7,0.5,3.5,1075


#### Renaming a column

In [112]:
customers.rename(columns= {'Renewal HBR' : 'renewal_hbr'}, inplace=True)
display(customers)

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id
0,1051,7,0.5,3.5,1052
1,1092,4,0.5,2.0,1093
2,1014,3,0.5,1.5,1015
3,1071,7,0.5,3.5,1072
4,1060,7,0.5,3.5,1061
5,1020,2,0.5,1.0,1021
6,1082,5,1.0,5.0,1083
7,1086,4,1.0,4.0,1087
8,1074,1,0.5,0.5,1075
9,1074,7,0.5,3.5,1075


##### User Defined Function

If what you want to do to a column that can't be represented by simple mathematical operations, you can write your own $\textit{user defined function}$ with the full customizability available in Python and any external Python packages, then map it directly onto a column. Let's add some ages to our customer dataframe, and then classify them into our custom defined grouping scheme:

In [115]:
# Random seed to create the new column for ages
np.random.seed(42)

# Instantiates a new column in our dataframe
customers['ages'] = np.random.randint(18,70,10)

# User-defined function
def make_age_groups(age:int):
    if 10 <= age < 20:
        return 'Teenager'
    elif age < 35:
        return 'Young Adult'
    elif age < 65:
        return 'Adult'
    else:
        return "Senior"

# This method uses the .apply function to create a column
customers['age_group_apply'] = customers['ages'].apply(make_age_groups)
customers

# We can also do the exact same thing using list comp
customers['age_group'] = [make_age_groups(age) for age in customers['ages']]
customers

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,age_group
0,1051,7,0.5,3.5,1052,56,Adult,Adult
1,1092,4,0.5,2.0,1093,69,Senior,Senior
2,1014,3,0.5,1.5,1015,46,Adult,Adult
3,1071,7,0.5,3.5,1072,32,Young Adult,Young Adult
4,1060,7,0.5,3.5,1061,60,Adult,Adult
5,1020,2,0.5,1.0,1021,25,Young Adult,Young Adult
6,1082,5,1.0,5.0,1083,38,Adult,Adult
7,1086,4,1.0,4.0,1087,56,Adult,Adult
8,1074,1,0.5,0.5,1075,36,Adult,Adult
9,1074,7,0.5,3.5,1075,40,Adult,Adult


As a last example I'll show here how you would use an apply function to create a UDF that depends on $\textit{more than one}$ column:
<li>UDF = User Defined Function</li>

In [116]:
def make_loyalty_age_group(row):
    age = row['ages']
    tenure = row['customer_tenure']
    
    if 10 <= age < 20:
        age_group = 'Teenager'
    elif age < 35:
        age_group = 'Young Adult'
    elif age < 65:
        age_group =  'Adult'
    else:
        age_group = "Senior"
    
    if tenure > 2.0:
        make_loyalty_age_group = f'Loyal {age_group}'
    else:
        make_loyalty_age_group = f'New {age_group}'
    
    return make_loyalty_age_group

customers['loyalty_age_group'] = customers.apply(make_loyalty_age_group, axis=1)
customers

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,age_group,loyalty_age_group
0,1051,7,0.5,3.5,1052,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,1093,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,1015,46,Adult,Adult,New Adult
3,1071,7,0.5,3.5,1072,32,Young Adult,Young Adult,Loyal Young Adult
4,1060,7,0.5,3.5,1061,60,Adult,Adult,Loyal Adult
5,1020,2,0.5,1.0,1021,25,Young Adult,Young Adult,New Young Adult
6,1082,5,1.0,5.0,1083,38,Adult,Adult,Loyal Adult
7,1086,4,1.0,4.0,1087,56,Adult,Adult,Loyal Adult
8,1074,1,0.5,0.5,1075,36,Adult,Adult,New Adult
9,1074,7,0.5,3.5,1075,40,Adult,Adult,Loyal Adult


### In-Class Exercise #3 - Create Your Own UDF <br>
<p>Using the Boston Red Sox data, create your own UDF which creates a new column called 'All-Star' and puts every player with either a batting average over .280 or an on base percentage of over .360 with a result of 'Yes' in the column and 'No' if not.</p>

In [None]:
"""
    Name  BA OBP AllStar
    --------------------
    Name .233 .360 Yes
    Name .150 .288 No
"""


### Aggregations <br>
<p>The raw data plus some transformations is generally only half the story. Your objective is to extract actual insights and actionable conclusions from the data, and that means reducing it from potentially billions of rows to some summary statistics via aggregation functions.</p>

##### groupby() <br>
<p>The .groupby() function is in some ways a 'master' aggregation.</p> 

<p>Data tables will usually reserve one column as a primary key - that is, a column for which each row has a unique value. This is to facilitate access to the exact rows of a data table that a user wants to view. The other columns will often have repeated values, such as the age groups in the above examples. We can use these columns to explore the data using the Pandas API:</p>

In [120]:
# Groupby with the column intact as the column/key
# Requires a form of aggregation function to be passed through!
display(customers.groupby('age_group').count())

# Use a groupby function to specify which columns we want to return
display(customers.groupby('age_group', as_index=False).count()[['customer_id', 'age_group']])

Unnamed: 0_level_0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,loyalty_age_group
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adult,7,7,7,7,7,7,7,7
Senior,1,1,1,1,1,1,1,1
Young Adult,2,2,2,2,2,2,2,2


Unnamed: 0,customer_id,age_group
0,7,Adult
1,1,Senior
2,2,Young Adult


##### Type of groupby()

<p>The result is a new dataframe, the columns of which all contain the counts of the grouped field. Notice the type of a grouped dataframe:</p>

In [21]:
print(type(customers), '-> Regular DataFrame object')
print(type(customers.groupby('age_group')), '-> Groupby Object')

<class 'pandas.core.frame.DataFrame'> -> Regular DataFrame object


KeyError: 'age_group'

<p>This is because simply grouping data doesn't quite make sense without an aggregation function like count() to pair with. In this case, we're counting occurances of the grouped field, but that's not all we can do. We can take averages, standard deviations, mins, maxes and much more! Let's see how this works a bit more:</p>

##### mean()

In [22]:
display(customers.groupby('ages').mean()['customer_tenure'])

KeyError: 'ages'

##### groupby() w/Multiple Columns

<p>We end up with the average age of the groups in the last column, the average tenure in the tenure column, and so on and so forth. You can even split the groups more finely by passing a list of columns to group by:</p>

In [127]:
customers.groupby(['age_group', 'ages']).count().sort_values('ages', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,age_group_apply,loyalty_age_group
age_group,ages,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Senior,69,1,1,1,1,1,1,1
Adult,60,1,1,1,1,1,1,1
Adult,56,2,2,2,2,2,2,2
Adult,46,1,1,1,1,1,1,1
Adult,40,1,1,1,1,1,1,1
Adult,38,1,1,1,1,1,1,1
Adult,36,1,1,1,1,1,1,1
Young Adult,32,1,1,1,1,1,1,1
Young Adult,25,1,1,1,1,1,1,1


##### drop_duplicates()

<p>Drops all duplicates from the current dataframe</p>

In [128]:
customers

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,age_group,loyalty_age_group
0,1051,7,0.5,3.5,1052,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,1093,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,1015,46,Adult,Adult,New Adult
3,1071,7,0.5,3.5,1072,32,Young Adult,Young Adult,Loyal Young Adult
4,1060,7,0.5,3.5,1061,60,Adult,Adult,Loyal Adult
5,1020,2,0.5,1.0,1021,25,Young Adult,Young Adult,New Young Adult
6,1082,5,1.0,5.0,1083,38,Adult,Adult,Loyal Adult
7,1086,4,1.0,4.0,1087,56,Adult,Adult,Loyal Adult
8,1074,1,0.5,0.5,1075,36,Adult,Adult,New Adult
9,1074,7,0.5,3.5,1075,40,Adult,Adult,Loyal Adult


In [130]:
customer_copy = customers.drop_duplicates('renewal_hbr').reset_index(drop=True)
customer_copy

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,age_group,loyalty_age_group
0,1051,7,0.5,3.5,1052,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,1093,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,1015,46,Adult,Adult,New Adult
3,1020,2,0.5,1.0,1021,25,Young Adult,Young Adult,New Young Adult
4,1082,5,1.0,5.0,1083,38,Adult,Adult,Loyal Adult
5,1074,1,0.5,0.5,1075,36,Adult,Adult,New Adult


In [133]:
# What if I wanted to save this to a file?
customer_copy.to_csv('customers.csv', index=False)

pd.read_csv(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\student_notebooks\customers.csv')

Unnamed: 0,customer_id,renewal_hbr,term_in_years,customer_tenure,aug_id,ages,age_group_apply,age_group,loyalty_age_group
0,1051,7,0.5,3.5,1052,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,1093,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,1015,46,Adult,Adult,New Adult
3,1020,2,0.5,1.0,1021,25,Young Adult,Young Adult,New Young Adult
4,1082,5,1.0,5.0,1083,38,Adult,Adult,Loyal Adult
5,1074,1,0.5,0.5,1075,36,Adult,Adult,New Adult


<p>Thus the groupby operation allows you to rapidly make summary observations about the state of your entire dataset at flexible granularity. In one line above, we actually did something very complicated - that's the power of the dataframe. In fact, the process often consists of several iterative groupby operations, each revealing greater insight than the last - if you don't know where to start with a dataset, try a bunch of groupbys!</p>

### Homework Excersise #1 - Find the Total Number of Runs and RBIs for the Red Sox <br>
<p>Get total number of home runs and rbi's</p>

In [16]:
import numpy as np
import pandas as pd
bs2017 = pd.read_csv(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\redsox_2017_hitting.txt')
bs2018 = pd.read_csv(r'C:\Users\jmira\OneDrive\Coding Temple Cohort 0501 0623\week_3\redsox_2018_hitting.txt')

# step 1: Add a new column with the key 'Team' and all column values should be 'BOS'
bs2017['Team'] = 'BOS'
bs2018['Team'] = 'BOS'

# step 2: Group by the 'Team' column and get total home runs and rbi's
a_17_18 = pd.concat([bs2017, bs2018])
a_1718 = a_17_18.groupby('Team').sum()[['HR', 'RBI']]
print(f"Total Sum of HR and RBI:\n{a_1718}\n")

# Produce data for both 2017 and 2018 (ie print both seperated by a newline character \n)
p_2017 = bs2017.groupby('Team').sum()[['HR', 'RBI']]
p_2018 = bs2018.groupby('Team').sum()[['HR', 'RBI']]
print(f"2017 data:\n{p_2017}\n")
print(f"2018 data:\n{p_2018}\n")

"""
TEAM    HR   RBI
----------------
BOS     144  538
"""


Total Sum of HR and RBI:
       HR   RBI
Team           
BOS   376  1561

2017 data:
       HR  RBI
Team          
BOS   168  735

2018 data:
       HR  RBI
Team          
BOS   208  826



'\nTEAM    HR   RBI\n----------------\nBOS     144  538\n'