# Coding Temple's Data Analytics Course
---
## Advanced Python Day 4: Intro to Pandas

## Tasks Today:

0) <b>Pre-Work</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Numpy Random Sampling

1) <b>Pandas</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Tabular Data Structures <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - from_dict() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - read_csv() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) <b>In-Class Exercise #1</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Accessing Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Indexing <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - df.loc <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - keys() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Slicing a DataFrame <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Built-In Methods <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - head() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - tail() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - describe() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - sort_values() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - .columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) <b>In-Class Exercise #2</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Filtration <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Conditionals <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) Column Transformations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Generating a New Column w/Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - User Defined Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; i) Aggregations <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Type of groupby() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - mean() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - groupby() w/Multiple Columns <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - drop_duplicates() <br>

## Numpy Random Sampling

In [62]:
import numpy as np
np.random.seed(42)

# A single call generates a single random number
print(f"Here is a random number: {np.random.uniform()}")

# You can also pass in some bounds
print(f'Here is a random number bewtween 0 and 1 million: {np.random.uniform(0,1e6)}')

# You can also generate a bunch of random numbers all at once
print(f'Here is a 3x3 matrix of random numbers between 0 and 1 Million:\n{np.random.uniform(0,1e6, (3,3))}')

# Instead of floats, lets generate random integers
print(f'Using Random Integers: {np.random.randint(0,10,4)}')

Here is a random number: 0.3745401188473625
Here is a random number bewtween 0 and 1 million: 950714.3064099161
Here is a 3x3 matrix of random numbers between 0 and 1 Million:
[[731993.94181141 598658.48419704 156018.64044244]
 [155994.5203362   58083.6121682  866176.14577494]
 [601115.01174321 708072.57779605  20584.4942958 ]]
Using Random Integers: [1 7 5 1]


## Pandas <br>

<p>Pandas is a flexible data analysis library built on top of NumPy that is excellent for working with tabular data. It is currently the de-facto standard for Python-based data analysis, and fluency in Pandas will do wonders for your productivity and frankly your resume. It is one of the fastest ways of getting from zero to answer in existence. </p>

<ul>
    <li>Pandas is a Python module, written in C. The Pandas module is a high performance, highly efficient, and high level data analysis library. It allows us to work with large sets of data called dataframes.</li>
    <li>Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)</li>
    <li>Dataframe = Spreadsheet (has column headers, index, etc.)</li>
</ul>

### Importing

In [12]:
# always use pd, standard for data science
import pandas as pd

### Tabular data structures <br>
<p>The central object of study in Pandas is the DataFrame, which is a tabular data structure with rows and columns like an excel spreadsheet. The first point of discussion is the creation of dataframes both from native Python dictionaries, and text files through the Pandas I/O system.</p>

In [13]:
names = ['Alice',
         'Bob',
         'James',
         'Beth',
         'John',
         'Sally',
         'Richard',
         'Lauren',
         'Brandon',
         'Sabrina']
ages = np.random.randint(18,35, len(names))

my_people = {
    'names' : names,
    'ages' : ages
}

my_people

{'names': ['Alice',
  'Bob',
  'James',
  'Beth',
  'John',
  'Sally',
  'Richard',
  'Lauren',
  'Brandon',
  'Sabrina'],
 'ages': array([18, 29, 29, 34, 27, 33, 32, 32, 29, 20])}

##### from_dict()

<p>Let's convert our not-so-useful-for-analysis dict into a Pandas dataframe. We can use the from_dict function to do this easily using Pandas:</p>

In [14]:
data = pd.DataFrame.from_dict(my_people)
data

Unnamed: 0,names,ages
0,Alice,18
1,Bob,29
2,James,29
3,Beth,34
4,John,27
5,Sally,33
6,Richard,32
7,Lauren,32
8,Brandon,29
9,Sabrina,20


##### read_csv()

In [16]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

marathon = pd.read_csv(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\boston_marathon_2017.csv')
marathon

Unnamed: 0,10K,15K,20K,25K,30K,35K,40K,5K,Age,Bib,...,Division,Gender,Half,M/F,Name,Number of Records,Official Time,Overall,Pace,State
0,12/30/1899 12:30:28 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:19 AM,12/30/1899 2:02:53 AM,12/30/1899 12:15:25 AM,24,11.0,...,1,1,12/30/1899 1:04:35 AM,M,"Kirui, Geoffrey",1,12/30/1899 2:09:37 AM,1,12/30/1899 12:04:57 AM,
1,12/30/1899 12:30:27 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:19 AM,12/30/1899 2:03:14 AM,12/30/1899 12:15:24 AM,30,17.0,...,2,2,12/30/1899 1:04:35 AM,M,"Rupp, Galen",1,12/30/1899 2:09:58 AM,2,12/30/1899 12:04:58 AM,OR
2,12/30/1899 12:30:29 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:16 AM,12/30/1899 1:17:00 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:31 AM,12/30/1899 2:03:38 AM,12/30/1899 12:15:25 AM,25,23.0,...,3,3,12/30/1899 1:04:36 AM,M,"Osako, Suguru",1,12/30/1899 2:10:28 AM,3,12/30/1899 12:04:59 AM,
3,12/30/1899 12:30:29 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:19 AM,12/30/1899 1:17:00 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:58 AM,12/30/1899 2:04:35 AM,12/30/1899 12:15:25 AM,32,21.0,...,4,4,12/30/1899 1:04:45 AM,M,"Biwott, Shadrack",1,12/30/1899 2:12:08 AM,4,12/30/1899 12:05:03 AM,CA
4,12/30/1899 12:30:28 AM,12/30/1899 12:45:44 AM,12/30/1899 1:01:15 AM,12/30/1899 1:16:59 AM,12/30/1899 1:33:01 AM,12/30/1899 1:48:41 AM,12/30/1899 2:05:00 AM,12/30/1899 12:15:25 AM,31,9.0,...,5,5,12/30/1899 1:04:35 AM,M,"Chebet, Wilson",1,12/30/1899 2:12:35 AM,5,12/30/1899 12:05:04 AM,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26405,12/30/1899 1:35:41 AM,12/30/1899 2:23:35 AM,12/30/1899 3:12:44 AM,12/30/1899 4:12:06 AM,12/30/1899 5:03:08 AM,12/30/1899 5:55:18 AM,12/30/1899 6:46:57 AM,12/30/1899 12:46:44 AM,61,25166.0,...,344,11972,12/30/1899 3:23:31 AM,F,"Steinbach, Paula Eyvonne",1,12/30/1899 7:09:39 AM,26407,12/30/1899 12:16:24 AM,CA
26406,12/30/1899 1:05:33 AM,12/30/1899 1:52:17 AM,12/30/1899 2:49:41 AM,12/30/1899 3:50:19 AM,12/30/1899 4:50:01 AM,12/30/1899 5:53:48 AM,12/30/1899 6:54:21 AM,12/30/1899 12:32:03 AM,25,25178.0,...,4774,14436,12/30/1899 3:00:26 AM,M,"Avelino, Andrew R.",1,12/30/1899 7:16:59 AM,26408,12/30/1899 12:16:40 AM,NC
26407,12/30/1899 1:43:36 AM,12/30/1899 2:32:36 AM,,12/30/1899 4:15:21 AM,12/30/1899 5:06:37 AM,12/30/1899 6:00:33 AM,12/30/1899 6:54:38 AM,12/30/1899 12:53:11 AM,57,27086.0,...,698,11973,12/30/1899 3:36:24 AM,F,"Hantel, Johanna",1,12/30/1899 7:19:37 AM,26409,12/30/1899 12:16:47 AM,PA
26408,12/30/1899 1:27:19 AM,12/30/1899 2:17:17 AM,12/30/1899 3:11:40 AM,12/30/1899 4:06:10 AM,12/30/1899 5:07:09 AM,12/30/1899 6:06:07 AM,12/30/1899 6:56:08 AM,12/30/1899 12:40:34 AM,64,25268.0,...,1043,14437,12/30/1899 3:22:30 AM,M,"Reilly, Bill",1,12/30/1899 7:20:44 AM,26410,12/30/1899 12:16:49 AM,NY


### In-Class Exercise #1 - Read in Boston Red Sox Hitting Data <br>
<p>Use the pandas read_csv() method to read in the statistics from the two files yesterday.</p>

In [19]:
sox17 = pd.read_csv(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\redsox_2017_hitting.txt')
sox18 = pd.read_csv(r'C:\Users\Alex Lucchesi\coding-temple\coding_temple_data_analytics_ft\week-3\data\redsox_2018_hitting.txt')

### Accessing Data <br>

##### Indexing

<p>You can directly select a column of a dataframe just like you would a dict. The result is a Pandas 'Series' object.</p>

In [20]:
data_ages = data['ages']
print(data_ages)

# Lets look at the type of the data!

#Series
print(type(data_ages))

#DataFrame
print(type(data))

#DataSeries Indexing - Zero Based!
print(data_ages[0])
print(data_ages[3])

# DataFrame Indexing - also Zero Based!
# Point to dataframe, then column, then row number
print(data['ages'][0])
print(data['ages'][3])

# We can change the types too!
print(float(data_ages[0]))


0    18
1    29
2    29
3    34
4    27
5    33
6    32
7    32
8    29
9    20
Name: ages, dtype: int32
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
18
34
18
34
18.0


##### df.loc

Along the horizontal dimension, rows of Pandas DataFrames are Row objects. You will notice there is a third column present in the DataFrame - this is the $\textit{index}$. It is automatically generated as a row number, but can be reassigned to a column of your choice using the DataFrame.set_index(colname) method. We can use it to access particular Pandas $\textit{rows}$, which are also Series objects:

In [32]:
print(data)

# Grab the first row of data --by the index of the row
my_row0 = data.loc[0]
my_row3 = data.loc[3]
print(f'\nLocation of the 0th row in the data:\n{my_row0}')
print(f'\nLocation of the 3rd row in the data:\n{my_row3}')

# Getting multiple values from df.loc
multi_val = data.loc[[0,1,2]]['names']
multi_val_slice = data.loc[0:2]['names']
print(f'\nLocation of multiple people:\n{multi_val}')
print(f'\nLocation of multiple people w/ a slice:\n{multi_val_slice}')

# Using df.loc and returning a dataframe object
multi_val_df = data.loc[[0,1,2]][['names', 'ages']]
multi_val_df_slice = data.loc[0:2][:]
print(f'\nLocation of multiple people df:\n{multi_val_df}')
print(f'\nLocation of multiple people df sliced:\n{multi_val_df_slice}')

# Using df.loc to set a user-specified index
user_index = data.loc[[0,1,2,3]].set_index('names')
user_index_slice = data.loc[0:3].set_index('names')
print(f'\nThe new index is: \n{user_index.index}')
print(f'\nThe new index when using a slice is: \n{user_index_slice.index}')

     names  ages
0    Alice    18
1      Bob    29
2    James    29
3     Beth    34
4     John    27
5    Sally    33
6  Richard    32
7   Lauren    32
8  Brandon    29
9  Sabrina    20

Location of the 0th row in the data:
names    Alice
ages        18
Name: 0, dtype: object

Location of the 3rd row in the data:
names    Beth
ages       34
Name: 3, dtype: object

Location of multiple people:
0    Alice
1      Bob
2    James
Name: names, dtype: object

Location of multiple people w/ a slice:
0    Alice
1      Bob
2    James
Name: names, dtype: object

Location of multiple people df:
   names  ages
0  Alice    18
1    Bob    29
2  James    29

Location of multiple people df sliced:
   names  ages
0  Alice    18
1    Bob    29
2  James    29

The new index is: 
Index(['Alice', 'Bob', 'James', 'Beth'], dtype='object', name='names')

The new index when using a slice is: 
Index(['Alice', 'Bob', 'James', 'Beth'], dtype='object', name='names')


##### keys()

In [36]:
# Access all of the keys/columns of the dataframe
# Dataframe.keys()
print(data.keys())
print(type(sox17.keys()))
print(type(sox17.keys().to_list()))

Index(['names', 'ages'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
<class 'list'>


##### Slicing a DataFrame

In [51]:
# printing all data for context
print(f'\n{data}')
print(f'\n{data[:]}')
print(f'\n{data[:1]}')
print(f'\n{data[5:]}')
print(f'\n{data[2:5]}')


     names  ages
0    Alice    18
1      Bob    29
2    James    29
3     Beth    34
4     John    27
5    Sally    33
6  Richard    32
7   Lauren    32
8  Brandon    29
9  Sabrina    20

     names  ages
0    Alice    18
1      Bob    29
2    James    29
3     Beth    34
4     John    27
5    Sally    33
6  Richard    32
7   Lauren    32
8  Brandon    29
9  Sabrina    20

   names  ages
0  Alice    18

     names  ages
5    Sally    33
6  Richard    32
7   Lauren    32
8  Brandon    29
9  Sabrina    20

   names  ages
2  James    29
3   Beth    34
4   John    27


### Built-In Methods <br>

<p>These are methods that are frequently used when using Pandas to make your life easier. It is possible to spend a whole week simply exploring the built-in functions supported by DataFrames in Pandas. Here however, we will simply highlight a few ones that might be useful, to give you an idea of what's possible out of the box with Pandas:</p>

##### .head()

In [52]:
# DataFrame.head()  -- Accepts integer parameter(gives access to more rows)
sox17.head()

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Christian Vazquez,26,99,345,324,43,94,18,...,0.33,0.404,0.735,91,131,14,3,0,1,0
1,2,1B,Mitch Moreland,31,149,576,508,73,125,34,...,0.326,0.443,0.769,99,225,14,6,0,5,6
2,3,2B,Dustin Pedroia,33,105,463,406,46,119,19,...,0.369,0.392,0.76,100,159,11,2,2,4,4
3,4,SS,Xander Bogaerts,24,148,635,571,94,156,32,...,0.343,0.403,0.746,95,230,17,6,0,2,6
4,5,3B,Rafael Devers,20,58,240,222,34,63,14,...,0.338,0.482,0.819,111,107,5,0,0,0,3


##### .tail()

In [54]:
# DataFrame.tail()  -- Accepts integer parameter(gives access to more rows)
sox17.tail()

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
18,19,IF,Marco Hernandez,24,21,60,58,7,16,3,...,0.3,0.328,0.628,65,19,0,1,0,0,0
19,20,UT,Rajai Davis,36,17,38,36,7,9,2,...,0.289,0.306,0.595,56,11,2,1,0,0,0
20,21,UT,Steve Selsky,27,8,9,9,0,1,1,...,0.111,0.222,0.333,-16,2,0,0,0,0,0
21,22,UT,Blake Swihart,25,6,7,5,1,1,0,...,0.429,0.2,0.629,74,1,0,0,0,0,0
22,23,2B,Chase d'Arnaud,30,2,1,1,2,1,0,...,1.0,1.0,2.0,428,1,0,0,0,0,0


##### .describe()
Probably one of the most important methods to understand. .describe collects all summary statistics in one dataframe object, allowing easy viewing and understanding
Of the count of values, mean, standard deviation, minimum value, maximum value, and inner-quartile ranges

In [59]:
# DataFrame.describe() -- Accepts parameters (include, exclude)
# help(sox17.describe())
sox17.describe()

Unnamed: 0,Rk,Age,G,PA,AB,R,H,2B,3B,HR,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
count,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,...,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0,23.0
mean,12.0,27.478261,72.086957,274.565217,245.521739,34.130435,63.434783,13.086957,0.826087,7.304348,...,0.346217,0.393348,0.739609,93.521739,100.086957,6.130435,2.304348,0.347826,1.565217,2.086957
std,6.78233,4.110624,52.747178,236.592059,209.032157,30.804817,55.645653,12.630817,1.466355,8.309003,...,0.153742,0.156269,0.298278,78.686299,91.930267,5.786382,2.566122,0.775107,2.149547,3.073539
min,1.0,20.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.111,0.2,0.333,-16.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,6.5,24.0,28.5,74.5,67.0,8.5,18.0,2.0,0.0,0.0,...,0.2985,0.3305,0.625,63.0,22.5,1.5,0.0,0.0,0.0,0.0
50%,12.0,27.0,64.0,188.0,171.0,30.0,53.0,12.0,0.0,5.0,...,0.325,0.387,0.709,88.0,89.0,4.0,2.0,0.0,1.0,0.0
75%,17.5,30.0,119.0,502.0,444.0,52.0,118.5,19.0,1.5,10.0,...,0.348,0.4265,0.7645,99.5,176.5,10.0,3.0,0.0,2.0,4.0
max,23.0,36.0,153.0,712.0,628.0,101.0,166.0,46.0,6.0,24.0,...,1.0,1.0,2.0,428.0,288.0,17.0,9.0,3.0,8.0,9.0


In [60]:
# .describe has an argument called exclude where we can specify the type of data we want to exclude.
sox17.describe(exclude='number')

Unnamed: 0,Pos,Name
count,23,23
unique,11,23
top,UT,Christian Vazquez
freq,6,1


##### .sort_values()

In [71]:
# Sort based on many labels, with left-to-right priority
# DataFrame.sort_values('key')
sorted_data_names = data.sort_values('names', kind = 'mergesort')
sorted_data_ages = data.sort_values('ages', kind='mergesort')

sorted_data_names.reset_index(drop=True)
sorted_data_ages.reset_index(drop=True)

display(sorted_data_names)
display(sorted_data_ages)



Unnamed: 0,names,ages
0,Alice,18
3,Beth,34
1,Bob,29
8,Brandon,29
2,James,29
4,John,27
7,Lauren,32
6,Richard,32
9,Sabrina,20
5,Sally,33


Unnamed: 0,names,ages
0,Alice,18
9,Sabrina,20
4,John,27
1,Bob,29
2,James,29
8,Brandon,29
6,Richard,32
7,Lauren,32
5,Sally,33
3,Beth,34


##### .isnull()

This method applies a boolean mask across the DataFrame object, returning True for any NaN(Null) values and False for all others.

In [72]:
sox17.isnull()

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [73]:
# Using a .sum() method, we can sum up across the y axis the total number of null values
sox17.isnull().sum()

Rk      0
Pos     0
Name    0
Age     0
G       0
PA      0
AB      0
R       0
H       0
2B      0
3B      0
HR      0
RBI     0
SB      0
CS      0
BB      0
SO      0
BA      0
OBP     0
SLG     0
OPS     0
OPS+    0
TB      0
GDP     0
HBP     0
SH      0
SF      0
IBB     0
dtype: int64

##### .nunique()

In [76]:
# Provides a total of unique values present within a dataframe.
print(sox17.nunique())
print(sox17['Rk'].nunique())


Rk      23
Pos     11
Name    23
Age     13
G       22
PA      23
AB      23
R       18
H       21
2B      15
3B       5
HR      11
RBI     16
SB      12
CS       5
BB      19
SO      22
BA      22
OBP     22
SLG     22
OPS     22
OPS+    23
TB      21
GDP     13
HBP      6
SH       4
SF       7
IBB      8
dtype: int64
23


##### .info()

In [78]:
# .info provides useful information on each column present in a dataframe including
# Number of non-null values, the data type of the column and the number of rows/columns
sox17.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 28 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      23 non-null     int64  
 1   Pos     23 non-null     object 
 2   Name    23 non-null     object 
 3   Age     23 non-null     int64  
 4   G       23 non-null     int64  
 5   PA      23 non-null     int64  
 6   AB      23 non-null     int64  
 7   R       23 non-null     int64  
 8   H       23 non-null     int64  
 9   2B      23 non-null     int64  
 10  3B      23 non-null     int64  
 11  HR      23 non-null     int64  
 12  RBI     23 non-null     int64  
 13  SB      23 non-null     int64  
 14  CS      23 non-null     int64  
 15  BB      23 non-null     int64  
 16  SO      23 non-null     int64  
 17  BA      23 non-null     float64
 18  OBP     23 non-null     float64
 19  SLG     23 non-null     float64
 20  OPS     23 non-null     float64
 21  OPS+    23 non-null     int64  
 22  TB  

##### .shape

In [57]:
# The dataframe has a shape property, just like a NumPy matrix. 
# print(df.shape) -- DataFrame.shape -- No Parameter
sox17.shape

(23, 28)

##### .columns

In [82]:
# will show all cols headers
# DataFrame.columns -- has no parameters
print(sox17.columns)
print(f'\n The type of the .columns attribute is: {type(sox17.columns)}\n')

# Keys brings back the 'index' of whatever data type we are working with 
print(sox17.keys())
print(f'\nThe type the .keys() method returns is: {type(sox17.keys())}')

Index(['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB',
       'GDP', 'HBP', 'SH', 'SF', 'IBB'],
      dtype='object')

 The type of the .columns attribute is: <class 'pandas.core.indexes.base.Index'>

Index(['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB',
       'GDP', 'HBP', 'SH', 'SF', 'IBB'],
      dtype='object')

The type the .keys() method returns is: <class 'pandas.core.indexes.base.Index'>


### In-Class Exercise #2 - Describe & Sort Boston Red Sox Hitting Data <br>
<p>Take the data that you read in earlier from the Red Sox csv's and use the describe method to understand the data better. Compare the two years and decide which team is having the better year. Then sort the values based on Batting Average.</p>

In [61]:
import pandas as pd
sox17 = pd.read_csv(r"C:\Users\narim\Downloads\redsox_2017_hitting.txt")
sox18 = pd.read_csv(r"C:\Users\narim\Downloads\redsox_2018_hitting.txt")

sox17['age']





KeyError: 'age'

### Filtration <br>
<p>Let's look at how to filter dataframes for rows that fulfill a specific conditon.</p>

##### Conditionals

In [84]:
# Conditional boolean dataframe
condition = data['ages'] >=25
data['ages'] >=25

0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: ages, dtype: bool

##### Subsetting

In [87]:
# exactly like numpy
display(data[condition])

# Same as:
data[data['ages'] >=25]

Unnamed: 0,names,ages
1,Bob,29
2,James,29
3,Beth,34
4,John,27
5,Sally,33
6,Richard,32
7,Lauren,32
8,Brandon,29


Unnamed: 0,names,ages
1,Bob,29
2,James,29
3,Beth,34
4,John,27
5,Sally,33
6,Richard,32
7,Lauren,32
8,Brandon,29


### Column Transformations <br>
<p>Rarely, if ever, will the columns in the original raw dataframe read from CSV or database table be the ones you actually need for your analysis. You will spend lots of time constantly transforming columns or groups of columns using general computational operations to produce new ones that are functions of the old ones. Pandas has full support for this: Consider the following dataframe containing membership term and renewal number for a group of customers:</p>

In [104]:
# Generate some fake data
np.random.seed(42) # Random seed to keep data the same
customer_id = np.random.randint(1000,1100,10)
renewal_nbr = np.random.randint(0,10,10)
customer_dict = {1: 0.5, 0: 1}
terms_in_years = [customer_dict[key] for key in np.random.randint(0,2,10)]

random_data = {
    'customer_id': customer_id,
    'renewal_nbr': renewal_nbr,
    'term_in_years': terms_in_years
}

customers = pd.DataFrame.from_dict(random_data)
customers

Unnamed: 0,customer_id,renewal_nbr,term_in_years
0,1051,7,0.5
1,1092,4,0.5
2,1014,3,0.5
3,1071,7,0.5
4,1060,7,0.5
5,1020,2,0.5
6,1082,5,1.0
7,1086,4,1.0
8,1074,1,0.5
9,1074,7,0.5


##### Feature Engineering a New Column w/Data

In [105]:
# DataFrame['key'] = Some Calculation from our DataFrame Columns
customers['customer_teunre'] = customers['renewal_nbr'] * customers['term_in_years']

customers

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_teunre
0,1051,7,0.5,3.5
1,1092,4,0.5,2.0
2,1014,3,0.5,1.5
3,1071,7,0.5,3.5
4,1060,7,0.5,3.5
5,1020,2,0.5,1.0
6,1082,5,1.0,5.0
7,1086,4,1.0,4.0
8,1074,1,0.5,0.5
9,1074,7,0.5,3.5


#### Dropping a column

In [56]:
# Create a duplicate column
customers['customer_id_1'] = customers['customer_id']
display(customers)

# Let's try to drop it!
customers.drop('customer_id_1')

NameError: name 'customers' is not defined

##### Axis? What's That all about?
##### This: 
![atext](https://i.stack.imgur.com/dcoE3.jpg)  
If you check the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html), you'll see that DataFrame.drop() has "axis = 0" as a default. We need to explicitly (remember the Zen of Python?) tell pandas to look for the column we want to drop from the column axis, which is column 1.

In [54]:
# Now that we are armed with this knowledge, let's try again!
customers.drop('customer_id_1', axis=1)

#Looks like that worked! Let's double check
customers

# The column is still there! How can we fix this?
#Preferred method; saves the dataframe object without that column to a new dataframe
df = customers.drop('customer_id_1', axis=1)

#Another method, overwrites original dataframe object
customers.drop('customer_id_1', axis=1, inplace=True)

NameError: name 'customers' is not defined

#### Renaming a column

In [108]:
# Looks like I mis-named a column earlier! Let's fix that
customers.rename(columns = {'customer_teunre': 'customer_tenure'}, inplace=True)
customers.head()

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure
0,1051,7,0.5,3.5
1,1092,4,0.5,2.0
2,1014,3,0.5,1.5
3,1071,7,0.5,3.5
4,1060,7,0.5,3.5


##### User Defined Function

If what you want to do to a column that can't be represented by simple mathematical operations, you can write your own $\textit{user defined function}$ with the full customizability available in Python and any external Python packages, then map it directly onto a column. Let's add some ages to our customer dataframe, and then classify them into our custom defined grouping scheme:

In [122]:
# use .apply to map over dataframe
# Create a new column for ages
np.random.seed(42)
customers['ages'] = np.random.randint(18,70,10)

# User defined function
def make_age_groups(age):
    if 10 <= age < 20:
        return 'Teenager'
    elif age <35:
        return 'Young Adult'
    elif age < 65 :
        return 'Adult'
    else:
        return 'Senior'

customers['age_group_apply'] = customers['ages'].apply(make_age_groups)
customers['age_group'] = [make_age_groups(age) for age in customers['ages']]
customers


Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group,age_group_apply
0,1051,7,0.5,3.5,56,Adult,Adult
1,1092,4,0.5,2.0,69,Senior,Senior
2,1014,3,0.5,1.5,46,Adult,Adult
3,1071,7,0.5,3.5,32,Young Adult,Young Adult
4,1060,7,0.5,3.5,60,Adult,Adult
5,1020,2,0.5,1.0,25,Young Adult,Young Adult
6,1082,5,1.0,5.0,38,Adult,Adult
7,1086,4,1.0,4.0,56,Adult,Adult
8,1074,1,0.5,0.5,36,Adult,Adult
9,1074,7,0.5,3.5,40,Adult,Adult


As a last example I'll show here how you would use an apply function to create a UDF that depends on $\textit{more than one}$ column:
<li>UDF = User Defined Function</li>

In [50]:
#Axis for apply can only be 1 or 0 -- 1 being the X axis 0 being the Y axis
def make_loyalty_age_group(row):
    age = row['ages']
    tenure = row['customer_tenure']

    if 10 <= age < 20:
        age_group = 'Teenager'
    elif age <35:
        age_group = 'Young Adult'
    elif age < 65 :
        age_group = 'Adult'
    else:
        age_group = 'Senior'
    
    if tenure > 2.0:
        make_loyalty_age_group = f'Loyal {age_group}'
    else:
        make_loyalty_age_group = f'New {age_group}'
    
    return make_loyalty_age_group
    
customers['loyalty_age_group'] = customers.apply(make_loyalty_age_group, axis=1)
customers

NameError: name 'customers' is not defined

### In-Class Exercise #3 - Create Your Own UDF <br>
<p>Using the Boston Red Sox data, create your own UDF which creates a new column called 'All-Star' and puts every player with either a batting average over .280 or an on base percentage of over .360 with a result of 'Yes' in the column and 'No' if not.</p>

In [70]:
import pandas as pd
import numpy as np
sox17 = pd.read_csv(r"C:\Users\narim\Downloads\redsox_2017_hitting.txt")

def add_all_starS(row):

   
"""
    Name  BA OBP AllStar
    --------------------
    Name .233 .360 Yes
    Name .150 .288 No
"""


IndentationError: expected an indented block after function definition on line 5 (94178456.py, line 8)

### Aggregations <br>
<p>The raw data plus some transformations is generally only half the story. Your objective is to extract actual insights and actionable conclusions from the data, and that means reducing it from potentially billions of rows to some summary statistics via aggregation functions.</p>

##### groupby() <br>
<p>The .groupby() function is in some ways a 'master' aggregation.</p> 

<p>Data tables will usually reserve one column as a primary key - that is, a column for which each row has a unique value. This is to facilitate access to the exact rows of a data table that a user wants to view. The other columns will often have repeated values, such as the age groups in the above examples. We can use these columns to explore the data using the Pandas API:</p>

In [136]:
# Using the groupby with the column intact as a column/key
display(customers.groupby('age_group').count())

# Using a groupby function, we can specify which columns we want to grab
display(customers.groupby('age_group', as_index = False).count()[['customer_id','age_group']])


Unnamed: 0_level_0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group_apply,loyalty_age_group
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Adult,7,7,7,7,7,7,7
Senior,1,1,1,1,1,1,1
Young Adult,2,2,2,2,2,2,2


Unnamed: 0,customer_id,age_group
0,7,Adult
1,1,Senior
2,2,Young Adult


##### Type of groupby()

<p>The result is a new dataframe, the columns of which all contain the counts of the grouped field. Notice the type of a grouped dataframe:</p>

In [139]:
print(type(customers), ' -> Regular Data Frame')
print(type(customers.groupby('age_group')), ' -> Groupby Object')

<class 'pandas.core.frame.DataFrame'>  -> Regular Data Frame
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>  -> Groupby Object


<p>This is because simply grouping data doesn't quite make sense without an aggregation function like count() to pair with. In this case, we're counting occurances of the grouped field, but that's not all we can do. We can take averages, standard deviations, mins, maxes and much more! Let's see how this works a bit more:</p>

##### mean()

In [141]:
display(customers.groupby('age_group').mean())
display(customers.groupby('age_group').mean()[['customer_tenure', 'ages']])


  display(customers.groupby('age_group').mean())


Unnamed: 0_level_0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adult,1063.0,4.857143,0.642857,3.071429,47.428571
Senior,1092.0,4.0,0.5,2.0,69.0
Young Adult,1045.5,4.5,0.5,2.25,28.5


  display(customers.groupby('age_group').mean()[['customer_tenure', 'ages']])


Unnamed: 0_level_0,customer_tenure,ages
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1
Adult,3.071429,47.428571
Senior,2.0,69.0
Young Adult,2.25,28.5


##### groupby() w/Multiple Columns

<p>We end up with the average age of the groups in the last column, the average tenure in the tenure column, and so on and so forth. You can even split the groups more finely by passing a list of columns to group by:</p>

In [142]:
customers.groupby(['age_group', 'ages']).mean().sort_values('ages', ascending=False)

  customers.groupby(['age_group', 'ages']).mean().sort_values('ages', ascending=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,customer_id,renewal_nbr,term_in_years,customer_tenure
age_group,ages,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Senior,69,1092.0,4.0,0.5,2.0
Adult,60,1060.0,7.0,0.5,3.5
Adult,56,1068.5,5.5,0.75,3.75
Adult,46,1014.0,3.0,0.5,1.5
Adult,40,1074.0,7.0,0.5,3.5
Adult,38,1082.0,5.0,1.0,5.0
Adult,36,1074.0,1.0,0.5,0.5
Young Adult,32,1071.0,7.0,0.5,3.5
Young Adult,25,1020.0,2.0,0.5,1.0


##### drop_duplicates()

<p>Drops all duplicates from the current dataframe</p>

In [143]:
customers

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group,age_group_apply,loyalty_age_group
0,1051,7,0.5,3.5,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,46,Adult,Adult,New Adult
3,1071,7,0.5,3.5,32,Young Adult,Young Adult,Loyal Young Adult
4,1060,7,0.5,3.5,60,Adult,Adult,Loyal Adult
5,1020,2,0.5,1.0,25,Young Adult,Young Adult,New Young Adult
6,1082,5,1.0,5.0,38,Adult,Adult,Loyal Adult
7,1086,4,1.0,4.0,56,Adult,Adult,Loyal Adult
8,1074,1,0.5,0.5,36,Adult,Adult,New Adult
9,1074,7,0.5,3.5,40,Adult,Adult,Loyal Adult


In [144]:
customer_copy = customers.drop_duplicates('renewal_nbr').reset_index(drop=True)
customer_copy

Unnamed: 0,customer_id,renewal_nbr,term_in_years,customer_tenure,ages,age_group,age_group_apply,loyalty_age_group
0,1051,7,0.5,3.5,56,Adult,Adult,Loyal Adult
1,1092,4,0.5,2.0,69,Senior,Senior,New Senior
2,1014,3,0.5,1.5,46,Adult,Adult,New Adult
3,1020,2,0.5,1.0,25,Young Adult,Young Adult,New Young Adult
4,1082,5,1.0,5.0,38,Adult,Adult,Loyal Adult
5,1074,1,0.5,0.5,36,Adult,Adult,New Adult


In [145]:
# Send customer data into a CSV file
customer_copy.to_csv('customers.csv')

<p>Thus the groupby operation allows you to rapidly make summary observations about the state of your entire dataset at flexible granularity. In one line above, we actually did something very complicated - that's the power of the dataframe. In fact, the process often consists of several iterative groupby operations, each revealing greater insight than the last - if you don't know where to start with a dataset, try a bunch of groupbys!</p>

### Homework Excersise #1 - Find the Total Number of Runs and RBIs for the Red Sox <br>
<p>Get total number of home runs and rbi's</p>

In [115]:
# step 1: Add a new column with the key 'Team' and all column values should be 'BOS'

import pandas as pd

sox_2017 = pd.read_csv(r"C:\Users\narim\Downloads\redsox_2017_hitting.txt")

sox_2018 = pd.read_csv(r"C:\Users\narim\Downloads\redsox_2018_hitting.txt")

data = pd.concat([sox_2017, sox_2018])
data['Team'] = 'BOS'


# step 2: Group by the 'Team' column and get total home runs and rbi's

Final = data.groupby('Team').agg({'HR': 'sum', 'RBI': 'sum'}).reset_index()

print(Final.to_string(index=False))

# Produce data for both 2017 and 2018 (ie print both seperated by a newline character \n)



"""
TEAM    HR   RBI
----------------
BOS     144  538
"""


Team  HR  RBI
 BOS 376 1561


'\nTEAM    HR   RBI\n----------------\nBOS     144  538\n'