### I. Let's create our first Dataframe

- Think of `DataFrame` as tables with a rich set of functionalities for data analysis and manipulation
- It is the main data structure of the `pandas` library
- Has tabular properties such as rows and columns including indices

In [1]:
# Bring in the library
import pandas as pd

In [2]:
# Let's instantiate a dataframe with dummy data
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], columns=["A","B","C"])

In [3]:
# It understands dictionaries too, altenatively...
data = {
    'A': [1, 4, 7],
    'B': [2, 5, 8],
    'C': [3, 6, 9],
    'C': [10, 11, 12]
}

df = pd.DataFrame(data)

In [4]:
# Ending a code cell with a `DataFrame` will print a formatted preview of its contents
df

Unnamed: 0,A,B,C
0,1,2,10
1,4,5,11
2,7,8,12


### 0, 1, 2...?

- These are called *__indices__*
- They are used to uniquely identify a row and pandas automatically assigned them for us by default
- Think of them as row labels (and yep, they are customizable)

In [7]:
# Customize indices to your liking
pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], columns=["A","B","C"], index=["w","x","y","z"])

Unnamed: 0,A,B,C
w,1,2,3
x,4,5,6
y,7,8,9
z,10,11,12


II. ### Describing Dataframes

Let's look at common functions to examine our dataframes

In [None]:
# To show the dimensions of the dataframe
df.shape

(3, 3)

In [21]:
# To show the # of elements
df.size

9

In [22]:
# To see the first or last N rows
df.head() # or df.tail()

Unnamed: 0,A,B,C
0,1,2,10
1,4,5,11
2,7,8,12


In [26]:
# To sample the dataset for previewing
df.sample(3)

Unnamed: 0,A,B,C
2,7,8,12
0,1,2,10
1,4,5,11


In [23]:
# To show a summary of the composition of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 200.0 bytes


In [24]:
# To show some basic statistics
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,4.0,5.0,11.0
std,3.0,3.0,1.0
min,1.0,2.0,10.0
25%,2.5,3.5,10.5
50%,4.0,5.0,11.0
75%,5.5,6.5,11.5
max,7.0,8.0,12.0


In [26]:
# To show the unique # of elements
df.nunique()

# To apply for a specific column
# df['A'].unique()

A    3
B    3
C    3
dtype: int64

In [30]:
# To get the indices in an array
df.index.tolist()

[0, 1, 2]

### III. From Files to DataFrames

- A more common use case in the real world however would be to load data from files
- Pandas supports tons of file formats which can be loaded easily through the `.read_*` function
- Using the Olympics Results dataset in different file formats, let's try loading data

In [8]:
# Read CSV file
results = pd.read_csv('./data/olympics_results.csv')
results.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [None]:
# Read Excel file
results = pd.read_excel('./data/olympics_data.xlsx')
results.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25


In [10]:
# It loaded the default sheet, but we can specify a specific sheet
results = pd.read_excel('./data/olympics_data.xlsx', sheet_name="results")
results.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [11]:
# Read feather file
result = pd.read_feather('./data/olympics_results.feather')
result.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


In [12]:
# Read parquet file
results = pd.read_parquet('./data/olympics_results.parquet')
results.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,


#### Knowledge check 💡

_Notice any differences when dealing with various file formats using the same exact dataset?_

### IV. From DataFrames to Files

- We can also export data back to files with pandas
- Useful for saving processed/cleaned data
- This is done through the `.to_*` function

In [15]:
# Export dataframe into JSON file
results.to_json('./results.json')

### V. Accessing Data

- Selecting elements from a dataframes can commonly be done using Python's slice notation ("`:`")
- Use `.iloc` for integer-based indexing and `.loc` for label-based indexing
- Use `.iat` or `.at` for fast scalar value access (single cell)
- Use these native indexer methods to efficiently select elements in dataframes whenever you can!

In [None]:
coffee = pd.read_csv('./coffee.csv')
coffee

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35


In [71]:
# To select specific rows using index
coffee.iloc[[0,1,5]]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
5,Wednesday,Latte,25


In [39]:
# To select rows from index 10 onwards
coffee.iloc[10:]

Unnamed: 0,Day,Coffee Type,Units Sold
10,Saturday,Espresso,45
11,Saturday,Latte,35
12,Sunday,Espresso,45
13,Sunday,Latte,35


In [41]:
# To select rows using start and stop positions
coffee.iloc[10:12]

Unnamed: 0,Day,Coffee Type,Units Sold
10,Saturday,Espresso,45
11,Saturday,Latte,35


#### Knowledge check 💡

_Pandas is actually following Python's zero-based indexing when slicing and it uses an exclusive upper bound. Why do you think this is done this way?_


In [43]:
# To select all rows
coffee.iloc[:]

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35


In [46]:
# To select all rows with specific columns
coffee.iloc[:, ['Day', 'Coffee Type']]

IndexError: .iloc requires numeric indexers, got ['Day' 'Coffee Type']

In [50]:
# .loc to the rescue
coffee.loc[:, ['Day', 'Coffee Type']]

Unnamed: 0,Day,Coffee Type
0,Monday,Espresso
1,Monday,Latte
2,Tuesday,Espresso
3,Tuesday,Latte
4,Wednesday,Espresso
5,Wednesday,Latte
6,Thursday,Espresso
7,Thursday,Latte
8,Friday,Espresso
9,Friday,Latte


In [52]:
# Using index labels instead of default integers
coffee.index = coffee["Day"]
coffee.head()

Unnamed: 0_level_0,Day,Coffee Type,Units Sold
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Monday,Monday,Espresso,25
Monday,Monday,Latte,15
Tuesday,Tuesday,Espresso,30
Tuesday,Tuesday,Latte,20
Wednesday,Wednesday,Espresso,35


In [60]:
# .loc works with labels
coffee.loc["Monday":"Wednesday", "Units Sold"]

Day
Monday       25
Monday       15
Tuesday      30
Tuesday      20
Wednesday    35
Wednesday    25
Name: Units Sold, dtype: int64

In [67]:
coffee = pd.read_csv('./coffee.csv') # reset df

# Use .iat or .at for selecting specific cells
coffee.iat[0,0]
coffee.at[0,"Day"]

'Monday'

In [68]:
# To select all rows using a specific column
coffee.Day

0        Monday
1        Monday
2       Tuesday
3       Tuesday
4     Wednesday
5     Wednesday
6      Thursday
7      Thursday
8        Friday
9        Friday
10     Saturday
11     Saturday
12       Sunday
13       Sunday
Name: Day, dtype: object

In [69]:
# But square brackets are more robust because it supports both single and multiple worded columns
coffee['Units Sold']

0     25
1     15
2     30
3     20
4     35
5     25
6     40
7     30
8     45
9     35
10    45
11    35
12    45
13    35
Name: Units Sold, dtype: int64

In [70]:
coffee.sort_values(["Units Sold", "Coffee Type"], ascending=[0,1])

Unnamed: 0,Day,Coffee Type,Units Sold
8,Friday,Espresso,45
10,Saturday,Espresso,45
12,Sunday,Espresso,45
6,Thursday,Espresso,40
4,Wednesday,Espresso,35
9,Friday,Latte,35
11,Saturday,Latte,35
13,Sunday,Latte,35
2,Tuesday,Espresso,30
7,Thursday,Latte,30


In [73]:
# You can manually iterate on the rows but USE SPARINGLY
# Not using the builtin methods loses the memory and performance benefits of pandas
for index, row in coffee.iterrows():
    print(index)
    print(row)
    # print(row["Coffee Type"])
    print("---")

0
Day              Monday
Coffee Type    Espresso
Units Sold           25
Name: 0, dtype: object
---
1
Day            Monday
Coffee Type     Latte
Units Sold         15
Name: 1, dtype: object
---
2
Day             Tuesday
Coffee Type    Espresso
Units Sold           30
Name: 2, dtype: object
---
3
Day            Tuesday
Coffee Type      Latte
Units Sold          20
Name: 3, dtype: object
---
4
Day            Wednesday
Coffee Type     Espresso
Units Sold            35
Name: 4, dtype: object
---
5
Day            Wednesday
Coffee Type        Latte
Units Sold            25
Name: 5, dtype: object
---
6
Day            Thursday
Coffee Type    Espresso
Units Sold           40
Name: 6, dtype: object
---
7
Day            Thursday
Coffee Type       Latte
Units Sold           30
Name: 7, dtype: object
---
8
Day              Friday
Coffee Type    Espresso
Units Sold           45
Name: 8, dtype: object
---
9
Day            Friday
Coffee Type     Latte
Units Sold         35
Name: 9, dtype: object
---

### VI. Filtering Data

- Accessing data requires precise selection whereas filtering allows us to describe certain __criteria__ to match elements
- This is mostly done using __boolean indexing__ and __comparison__ operators
- For this chapter, we'll use the __Olympics Biography__ dataset to explore filtering techniques

In [None]:
# Here's our dataset
bios = pd.read_csv('./data/olympics_bios.csv')
bios.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145500 entries, 0 to 145499
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   athlete_id    145500 non-null  int64  
 1   name          145500 non-null  object 
 2   born_date     143693 non-null  object 
 3   born_city     110908 non-null  object 
 4   born_region   110908 non-null  object 
 5   born_country  110908 non-null  object 
 6   NOC           145499 non-null  object 
 7   height_cm     106651 non-null  float64
 8   weight_kg     102070 non-null  float64
 9   died_date     33940 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 11.1+ MB


In [None]:
# To select rows from specific columns based on height condition
df = bios.loc[bios['height_cm'] > 215, ['name', 'height_cm']]
df.head()

Unnamed: 0,name,height_cm
5089,Viktor Pankrashkin,220.0
5583,Paulinho Villas Boas,217.0
5673,Gunther Behnke,221.0
5716,Uwe Blab,218.0
5781,Tommy Burleson,223.0


In [None]:
# Short-hand notation
bios[bios['height_cm'] > 215][['name', 'height_cm']]

In [None]:
# To filter using multiple conditions
bios[(bios['height_cm'] > 215)  & (bios['born_country'] == 'USA')]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
5781,5804,Tommy Burleson,1952-02-24,Crossnore,North Carolina,USA,United States,223.0,102.0,
6722,6755,Shaquille O'Neal,1972-03-06,Newark,New Jersey,USA,United States,216.0,137.0,
6937,6972,David Robinson,1965-08-06,Key West,Florida,USA,United States,216.0,107.0,
123850,126093,Tyson Chandler,1982-10-02,Hanford,California,USA,United States,216.0,107.0,


In [80]:
# To filter using string operations
bios[bios['name'].str.contains("maron", case=False)]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
78145,78736,Thomas Maroney,1895-08-12,New York,New York,USA,United States,180.0,72.0,1971-01-01
92660,93404,Kenneth Maronie,1980-03-04,Massacre,Saint Paul,DMA,Dominica,185.0,68.0,
124461,126764,McKayla Maroney,1995-12-09,Aliso Viejo,California,USA,United States,160.0,46.0,


In [83]:
# To filter names with repeated letters using regex
repeated_letters = bios[bios['name'].str.contains(r'(.)\1', na=False)]
repeated_letters.head()

  repeated_letters = bios[bios['name'].str.contains(r'(.)\1', na=False)]


Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
9,10,Guy de la Chapelle,1868-07-16,Farges-Allichamps,Cher,FRA,France,,,1923-08-27
13,14,Étienne Durand,,,,,France,,,
17,18,Jean-Philippe Gatien,1968-10-16,Alès,Gard,FRA,France,178.0,73.0,
20,21,Pierre Hirsch,1900-11-08,Paris Ie,Paris,FRA,France,,,1942-04-19
28,29,Albert Lippmann,,,,,France,,,


In [85]:
# To filter names with 'son' or 'sen' at the end using regex
son_sen_names = bios[bios['name'].str.contains(r'son$|sen$', case=False, na=False)]
son_sen_names.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
37,38,Helen Aitchison,1881-12-06,Sunderland,England,GBR,Great Britain,,,1947-05-26
92,93,Anne Gibson,1968-10-26,Dumfries,Scotland,GBR,Great Britain,173.0,70.0,
107,108,Anders Nielsen,1967-02-24,Cape Town,Western Cape,RSA,Great Britain,173.0,70.0,2010-07-29
113,114,Julian Robertson,1969-10-09,Peterborough,England,GBR,Great Britain,182.0,73.0,
120,121,Chris Wilkinson,1970-01-05,Southampton,England,GBR,Great Britain,180.0,72.0,


In [88]:
# To filter athletes from the 90's using regex
born_90s = bios[bios['born_date'].str.contains(r'^199', na=False)]
born_90s.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
102295,103238,Zhang Tianyi,1990-04-24,Dandong,Liaoning,CHN,People's Republic of China,172.0,58.0,
102364,103308,Anastasiya Prilepa,1990-03-15,Almaty,Almaty,KAZ,Kazakhstan,161.0,47.0,
102391,103338,Yvonne Yip,1990-10-22,Hong Kong,Hong Kong,HKG,"Hong Kong, China",157.0,47.0,
103143,104113,Sameera Al-Bitar,1990-02-21,Amman,Amman,JOR,Bahrain,168.0,64.0,
103151,104121,Tojohanitra Andriamanjatoarimanana,1990-10-31,,,,Madagascar,140.0,,


### Adding and Removing Columns

In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [None]:
# To filter 
bios[bios['born_country'].isin(["USA", "FRA", "GBR"]) & (bios['name'].str.startswith("Keith"))]

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22
14577,14674,Keith Harrison,1933-03-28,Birmingham,England,GBR,Great Britain,,,
16166,16281,Keith Reynolds,1963-12-25,Solihull,England,GBR,Great Britain,173.0,68.0,
18734,18862,Keith Sinclair,1945-06-26,Sunderland,England,GBR,Great Britain,190.0,79.0,
29897,30123,Keith Langley,1961-06-03,Aldershot,England,GBR,Great Britain,173.0,70.0,
34011,34275,Keith Remfry,1947-11-17,Ealing,England,GBR,Great Britain,193.0,114.0,2015-09-16
46885,47234,Keith Collin,1937-01-18,Marylebone,England,GBR,Great Britain,168.0,63.0,1991-03-06
50929,51288,Keith Carter,1924-08-30,Akron,Ohio,USA,United States,,,2013-05-03
51185,51544,Keith Russell,1948-01-15,Mesa,Arizona,USA,United States,188.0,73.0,


In [132]:
coffee['price'] = 4.99

In [68]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold,price
0,Monday,Espresso,25,4.99
1,Monday,Latte,10,4.99
2,Tuesday,Espresso,10,4.99
3,Tuesday,Latte,10,4.99
4,Wednesday,Espresso,35,4.99


In [133]:
import numpy as np

coffee['new_price'] = np.where(coffee['Coffee Type']=='Espresso', 3.99, 5.99)

In [70]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,new_price
0,Monday,Espresso,25,4.99,3.99
1,Monday,Latte,10,4.99,5.99
2,Tuesday,Espresso,10,4.99,3.99
3,Tuesday,Latte,10,4.99,5.99
4,Wednesday,Espresso,35,4.99,3.99
5,Wednesday,Latte,25,4.99,5.99
6,Thursday,Espresso,40,4.99,3.99
7,Thursday,Latte,30,4.99,5.99
8,Friday,Espresso,45,4.99,3.99
9,Friday,Latte,35,4.99,5.99


In [134]:
coffee.drop(columns=['price'])

Unnamed: 0,Day,Coffee Type,Units Sold,new_price
0,Monday,Espresso,25,3.99
1,Monday,Latte,15,5.99
2,Tuesday,Espresso,30,3.99
3,Tuesday,Latte,20,5.99
4,Wednesday,Espresso,35,3.99
5,Wednesday,Latte,25,5.99
6,Thursday,Espresso,40,3.99
7,Thursday,Latte,30,5.99
8,Friday,Espresso,45,3.99
9,Friday,Latte,35,5.99


In [72]:
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold,price,new_price
0,Monday,Espresso,25,4.99,3.99
1,Monday,Latte,10,4.99,5.99
2,Tuesday,Espresso,10,4.99,3.99
3,Tuesday,Latte,10,4.99,5.99
4,Wednesday,Espresso,35,4.99,3.99


In [135]:
coffee.drop(columns=['price'], inplace=True)
coffee

# new var points to the orig df by default use df.copy() for a fresh clone

Unnamed: 0,Day,Coffee Type,Units Sold,new_price
0,Monday,Espresso,25,3.99
1,Monday,Latte,15,5.99
2,Tuesday,Espresso,30,3.99
3,Tuesday,Latte,20,5.99
4,Wednesday,Espresso,35,3.99
5,Wednesday,Latte,25,5.99
6,Thursday,Espresso,40,3.99
7,Thursday,Latte,30,5.99
8,Friday,Espresso,45,3.99
9,Friday,Latte,35,5.99


In [75]:
coffee = coffee[['Day', 'Coffee Type', 'Units Sold', 'new_price']]

In [136]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,new_price
0,Monday,Espresso,25,3.99
1,Monday,Latte,15,5.99
2,Tuesday,Espresso,30,3.99
3,Tuesday,Latte,20,5.99
4,Wednesday,Espresso,35,3.99
5,Wednesday,Latte,25,5.99
6,Thursday,Espresso,40,3.99
7,Thursday,Latte,30,5.99
8,Friday,Espresso,45,3.99
9,Friday,Latte,35,5.99


In [137]:
coffee ['revenue'] = coffee['Units Sold'] * coffee['new_price']
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,new_price,revenue
0,Monday,Espresso,25,3.99,99.75
1,Monday,Latte,15,5.99,89.85
2,Tuesday,Espresso,30,3.99,119.7
3,Tuesday,Latte,20,5.99,119.8
4,Wednesday,Espresso,35,3.99,139.65
5,Wednesday,Latte,25,5.99,149.75
6,Thursday,Espresso,40,3.99,159.6
7,Thursday,Latte,30,5.99,179.7
8,Friday,Espresso,45,3.99,179.55
9,Friday,Latte,35,5.99,209.65


In [138]:
coffee.rename(columns={'new_price': 'price'})

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,25,3.99,99.75
1,Monday,Latte,15,5.99,89.85
2,Tuesday,Espresso,30,3.99,119.7
3,Tuesday,Latte,20,5.99,119.8
4,Wednesday,Espresso,35,3.99,139.65
5,Wednesday,Latte,25,5.99,149.75
6,Thursday,Espresso,40,3.99,159.6
7,Thursday,Latte,30,5.99,179.7
8,Friday,Espresso,45,3.99,179.55
9,Friday,Latte,35,5.99,209.65


In [79]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,new_price,revenue
0,Monday,Espresso,25,3.99,99.75
1,Monday,Latte,10,5.99,59.9
2,Tuesday,Espresso,10,3.99,39.9
3,Tuesday,Latte,10,5.99,59.9
4,Wednesday,Espresso,35,3.99,139.65
5,Wednesday,Latte,25,5.99,149.75
6,Thursday,Espresso,40,3.99,159.6
7,Thursday,Latte,30,5.99,179.7
8,Friday,Espresso,45,3.99,179.55
9,Friday,Latte,35,5.99,209.65


In [139]:
# coffee.rename(columns={'new_price': 'price'}, inplace=True)
coffee = coffee.rename(columns={'new_price': 'price'})

In [81]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,25,3.99,99.75
1,Monday,Latte,10,5.99,59.9
2,Tuesday,Espresso,10,3.99,39.9
3,Tuesday,Latte,10,5.99,59.9
4,Wednesday,Espresso,35,3.99,139.65
5,Wednesday,Latte,25,5.99,149.75
6,Thursday,Espresso,40,3.99,159.6
7,Thursday,Latte,30,5.99,179.7
8,Friday,Espresso,45,3.99,179.55
9,Friday,Latte,35,5.99,209.65


In [83]:
bios_new = bios.copy()

In [84]:
bios_new['first_name'] = bios_new['name'].str.split(' ').str[0]

In [86]:
bios_new.query('first_name == "Keith"')

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,first_name
1897,1907,Keith Hanlon,1966-09-01,,,,Ireland,,,,Keith
3505,3517,Keith Wallace,1961-03-29,Preston,England,GBR,Great Britain,165.0,51.0,1999-12-31,Keith
6228,6255,Keith Hartley,1940-10-15,Vancouver,British Columbia,CAN,Canada,200.0,85.0,,Keith
8898,8946,Keith Mwila,1966-01-01,,,,Zambia,,,1993-01-09,Keith
12053,12118,Keith Hervey,1898-11-03,Fulham,England,GBR,Great Britain,,,1973-02-22,Keith
...,...,...,...,...,...,...,...,...,...,...,...
99921,100722,Keith Carney,1970-02-03,Providence,Rhode Island,USA,United States,188.0,93.0,,Keith
102227,103168,Keith Beavers,1983-02-09,London,Ontario,CAN,Canada,185.0,75.0,,Keith
109900,111105,Keith Cumberpatch,1927-08-25,Christchurch,Canterbury,NZL,New Zealand,,,2013-11-15,Keith
115973,117348,Keith Sanderson,1975-02-02,Plymouth,Massachusetts,USA,United States,183.0,95.0,,Keith


In [87]:
bios_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145500 entries, 0 to 145499
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   athlete_id    145500 non-null  int64  
 1   name          145500 non-null  object 
 2   born_date     143693 non-null  object 
 3   born_city     110908 non-null  object 
 4   born_region   110908 non-null  object 
 5   born_country  110908 non-null  object 
 6   NOC           145499 non-null  object 
 7   height_cm     106651 non-null  float64
 8   weight_kg     102070 non-null  float64
 9   died_date     33940 non-null   object 
 10  first_name    145500 non-null  object 
dtypes: float64(2), int64(1), object(8)
memory usage: 12.2+ MB


In [88]:
bios_new['born_datetime'] = pd.to_datetime(bios_new['born_date'])

In [90]:
bios_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145500 entries, 0 to 145499
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   athlete_id     145500 non-null  int64         
 1   name           145500 non-null  object        
 2   born_date      143693 non-null  object        
 3   born_city      110908 non-null  object        
 4   born_region    110908 non-null  object        
 5   born_country   110908 non-null  object        
 6   NOC            145499 non-null  object        
 7   height_cm      106651 non-null  float64       
 8   weight_kg      102070 non-null  float64       
 9   died_date      33940 non-null   object        
 10  first_name     145500 non-null  object        
 11  born_datetime  143693 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1), object(8)
memory usage: 13.3+ MB


In [91]:
bios_new['born_year'] = bios_new['born_datetime'].dt.year

In [92]:
bios_new[['name','born_year']]

Unnamed: 0,name,born_year
0,Jean-François Blanchy,1886.0
1,Arnaud Boetsch,1969.0
2,Jean Borotra,1898.0
3,Jacques Brugnon,1895.0
4,Albert Canet,1878.0
...,...,...
145495,Polina Luchnikova,2002.0
145496,Valeriya Merkusheva,1999.0
145497,Yuliya Smirnova,1998.0
145498,André Foussard,1899.0


In [93]:
bios_new.to_csv('./data/bios_new.csv', index=False)

In [94]:
bios['height_category'] = bios['height_cm'].apply(lambda x: 'Short' if x < 165 else ('Average' if x <185 else 'Tall'))

In [95]:
bios.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,,Average
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17,Average
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20,Average
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25,Tall


In [97]:
def categorize_athlete(row):
    if row['height_cm'] < 175 and row['weight_kg'] < 70:
        return 'Lightweight'
    elif row ['height_cm'] < 185 or row['weight_kg'] <= 80:
        return 'Middleweight'
    
    else:
        return 'Heavyweight'

# 1 is rows, 0 is columns
bios['Category'] = bios.apply(categorize_athlete, axis=1)

In [98]:
bios.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,,Average,Middleweight
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17,Average,Middleweight
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20,Average,Lightweight
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25,Tall,Heavyweight


### Merging & Concatenating

In [99]:
nocs = pd.read_csv('./data/noc_regions.csv')

In [100]:
nocs.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [101]:
bios_new = pd.merge(bios, nocs, left_on='born_country', right_on='NOC', how='left')

In [103]:
bios_new.rename(columns={'region': 'born_country_full'}, inplace=True)

In [104]:
bios_new.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC_x,height_cm,weight_kg,died_date,height_category,Category,NOC_y,born_country_full,notes
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight,FRA,France,
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,,Average,Middleweight,FRA,France,
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17,Average,Middleweight,FRA,France,
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20,Average,Lightweight,FRA,France,
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25,Tall,Heavyweight,GBR,UK,


In [106]:
bios_new[bios_new['NOC_x'] != bios_new['born_country_full']][['name','NOC_x','born_country_full']]

Unnamed: 0,name,NOC_x,born_country_full
4,Albert Canet,France,UK
12,J. Defert,France,
13,Étienne Durand,France,
16,Guy Forget,France,Morocco
27,"Guy, Baron Lejeune",France,
...,...,...,...
145491,Matthew Wepke,Jamaica,
145493,Landysh Falyakhova,ROC,Russia
145495,Polina Luchnikova,ROC,Russia
145496,Valeriya Merkusheva,ROC,Russia


In [112]:
usa = bios[bios['born_country'] == 'USA'].copy()
ph = bios[bios['born_country'] == 'PHI'].copy()

new_df = pd.concat([usa,ph])

In [117]:
new_df

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category
54,55,Monique Javer,1967-07-22,Burlingame,California,USA,Great Britain,177.0,64.0,,Average,Middleweight
960,964,Xóchitl Escobedo,1968-09-17,West Covina,California,USA,Mexico,170.0,60.0,,Average,Lightweight
961,965,Angélica Gavaldón,1973-10-03,El Centro,California,USA,Mexico,160.0,54.0,,Short,Lightweight
1231,1238,Bert Schneider,1897-07-01,Cleveland,Ohio,USA,Canada,,,1986-02-20,Tall,Heavyweight
1345,1352,Laura Berg,1975-01-06,Santa Fe Springs,California,USA,United States,168.0,61.0,,Average,Lightweight
...,...,...,...,...,...,...,...,...,...,...,...,...
142281,145883,Elreen Ando,1998-11-01,Cebu City,Cebu City,PHI,Philippines,,,,Tall,Heavyweight
142282,145884,Ernest Obiena,1995-11-17,Manila,Manila,PHI,Philippines,,,,Tall,Heavyweight
142283,145885,Carlo Paalam,1998-07-16,Cagayan de Oro,Misamis Oriental,PHI,Philippines,,,,Tall,Heavyweight
142284,145886,Eumir Marcial,1995-10-29,Zamboanga,Zamboanga del Sur,PHI,Philippines,,,,Tall,Heavyweight


In [118]:
combined_df = pd.merge(results, bios, on='athlete_id', how='left')

In [119]:
combined_df.head()

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,...,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category
0,1912.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,17.0,True,...,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
1,1912.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jean Montariol,,False,...,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
2,1920.0,Summer,Tennis,"Singles, Men (Olympic)",Jean-François Blanchy,1,FRA,,32.0,True,...,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
3,1920.0,Summer,Tennis,"Doubles, Mixed (Olympic)",Jean-François Blanchy,1,FRA,Jeanne Vaussard,8.0,True,...,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
4,1920.0,Summer,Tennis,"Doubles, Men (Olympic)",Jean-François Blanchy,1,FRA,Jacques Brugnon,4.0,False,...,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight


### Handling Null values

In [141]:
# NaN = Not a Number
coffee.loc[[0,1], 'Units Sold'] = np.nan

In [121]:
coffee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Day          14 non-null     object 
 1   Coffee Type  14 non-null     object 
 2   Units Sold   12 non-null     float64
 3   price        14 non-null     float64
 4   revenue      14 non-null     float64
dtypes: float64(3), object(2)
memory usage: 688.0+ bytes


In [142]:
coffee.isna().sum()

Day            0
Coffee Type    0
Units Sold     2
price          0
revenue        0
dtype: int64

In [143]:
coffee.fillna(10000)

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,10000.0,3.99,99.75
1,Monday,Latte,10000.0,5.99,89.85
2,Tuesday,Espresso,30.0,3.99,119.7
3,Tuesday,Latte,20.0,5.99,119.8
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65


In [131]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold
0,Monday,Espresso,25
1,Monday,Latte,15
2,Tuesday,Espresso,30
3,Tuesday,Latte,20
4,Wednesday,Espresso,35
5,Wednesday,Latte,25
6,Thursday,Espresso,40
7,Thursday,Latte,30
8,Friday,Espresso,45
9,Friday,Latte,35


In [144]:
coffee.fillna(coffee['Units Sold'].mean())

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,35.0,3.99,99.75
1,Monday,Latte,35.0,5.99,89.85
2,Tuesday,Espresso,30.0,3.99,119.7
3,Tuesday,Latte,20.0,5.99,119.8
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65


In [159]:
coffee.loc[[2,3], 'Units Sold'] = np.nan
# coffee.loc[[0,1], 'Units Sold'] = 15

KeyError: "None of [Index([2, 3], dtype='int64')] are in the [index]"

In [None]:
coffee['Units Sold'].interpolate() # watch what happens

0     15.000000
1     15.000000
2     21.666667
3     28.333333
4     35.000000
5     25.000000
6     40.000000
7     30.000000
8     45.000000
9     35.000000
10    45.000000
11    35.000000
12    45.000000
13    35.000000
Name: Units Sold, dtype: float64

In [151]:
coffee['Units Sold'] = coffee['Units Sold'].interpolate()

In [None]:
coffee.dropna() # be careful! drops the full entire row

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,15.0,3.99,99.75
1,Monday,Latte,15.0,5.99,89.85
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65
10,Saturday,Espresso,45.0,3.99,179.55
11,Saturday,Latte,35.0,5.99,209.65


In [156]:
coffee.dropna(subset=['Units Sold'], inplace=True)

In [157]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,15.0,3.99,99.75
1,Monday,Latte,15.0,5.99,89.85
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65
10,Saturday,Espresso,45.0,3.99,179.55
11,Saturday,Latte,35.0,5.99,209.65


In [161]:
coffee[coffee['Units Sold'].isna()]
coffee[coffee['Units Sold'].notna()]

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,15.0,3.99,99.75
1,Monday,Latte,15.0,5.99,89.85
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65
10,Saturday,Espresso,45.0,3.99,179.55
11,Saturday,Latte,35.0,5.99,209.65


### Aggregating Data

In [162]:
bios.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,,Average,Middleweight
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17,Average,Middleweight
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20,Average,Lightweight
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25,Tall,Heavyweight


In [None]:
bios['born_city'].value_counts() # budapest leading the charge

born_city
Budapest           1378
Moskva (Moscow)     883
Oslo                708
Stockholm           629
Praha (Prague)      600
                   ... 
Kirovgrad             1
Pereiaslav            1
Podgornyy             1
Kudepsta              1
Furmanov              1
Name: count, Length: 22368, dtype: int64

In [164]:
bios[bios['born_country'] == 'USA']['born_region'].value_counts()

born_region
California              1634
New York                 990
Illinois                 585
Massachusetts            530
Pennsylvania             530
New Jersey               381
Texas                    368
Minnesota                365
Ohio                     328
Michigan                 319
Washington               240
Florida                  235
Wisconsin                209
Colorado                 207
Connecticut              156
Indiana                  150
Oregon                   132
Georgia                  129
Virginia                 121
Maryland                 117
District of Columbia     107
Iowa                     102
Hawaiʻi                   95
Kansas                    94
Oklahoma                  93
Louisiana                 92
Utah                      91
Missouri                  91
North Carolina            86
Arizona                   83
New Hampshire             83
Vermont                   68
Mississippi               66
Alabama                   64
Ke

In [169]:
coffee.groupby(['Coffee Type'])['Units Sold'].sum()
coffee.groupby(['Coffee Type'])['Units Sold'].mean()

Coffee Type
Espresso    37.500000
Latte       29.166667
Name: Units Sold, dtype: float64

In [170]:
coffee.groupby(['Coffee Type']).agg({'Units Sold': 'sum', 'price': 'mean'})

Unnamed: 0_level_0,Units Sold,price
Coffee Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Espresso,225.0,3.99
Latte,175.0,5.99


In [171]:
coffee.groupby(['Coffee Type', 'Day']).agg({'Units Sold': 'sum', 'price': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Units Sold,price
Coffee Type,Day,Unnamed: 2_level_1,Unnamed: 3_level_1
Espresso,Friday,45.0,3.99
Espresso,Monday,15.0,3.99
Espresso,Saturday,45.0,3.99
Espresso,Sunday,45.0,3.99
Espresso,Thursday,40.0,3.99
Espresso,Wednesday,35.0,3.99
Latte,Friday,35.0,5.99
Latte,Monday,15.0,5.99
Latte,Saturday,35.0,5.99
Latte,Sunday,35.0,5.99


In [172]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue
0,Monday,Espresso,15.0,3.99,99.75
1,Monday,Latte,15.0,5.99,89.85
4,Wednesday,Espresso,35.0,3.99,139.65
5,Wednesday,Latte,25.0,5.99,149.75
6,Thursday,Espresso,40.0,3.99,159.6
7,Thursday,Latte,30.0,5.99,179.7
8,Friday,Espresso,45.0,3.99,179.55
9,Friday,Latte,35.0,5.99,209.65
10,Saturday,Espresso,45.0,3.99,179.55
11,Saturday,Latte,35.0,5.99,209.65


In [173]:
pivoted = coffee.pivot(columns='Coffee Type', index='Day', values='revenue')

In [174]:
pivoted

Coffee Type,Espresso,Latte
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
Friday,179.55,209.65
Monday,99.75,89.85
Saturday,179.55,209.65
Sunday,179.55,209.65
Thursday,159.6,179.7
Wednesday,139.65,149.75


In [None]:
pivoted.loc['Monday', 'Latte']

np.float64(89.85000000000001)

In [176]:
pivoted.sum()

Coffee Type
Espresso     937.65
Latte       1048.25
dtype: float64

In [177]:
pivoted.sum(axis=1)

Day
Friday       389.2
Monday       189.6
Saturday     389.2
Sunday       389.2
Thursday     339.3
Wednesday    289.4
dtype: float64

In [178]:
bios.head()

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category
0,1,Jean-François Blanchy,1886-12-12,Bordeaux,Gironde,FRA,France,,,1960-10-02,Tall,Heavyweight
1,2,Arnaud Boetsch,1969-04-01,Meulan,Yvelines,FRA,France,183.0,76.0,,Average,Middleweight
2,3,Jean Borotra,1898-08-13,Biarritz,Pyrénées-Atlantiques,FRA,France,183.0,76.0,1994-07-17,Average,Middleweight
3,4,Jacques Brugnon,1895-05-11,Paris VIIIe,Paris,FRA,France,168.0,64.0,1978-03-20,Average,Lightweight
4,5,Albert Canet,1878-04-17,Wandsworth,England,GBR,France,,,1930-07-25,Tall,Heavyweight


In [181]:
bios['born_date'] = pd.to_datetime(bios['born_date'])
bios.groupby(bios['born_date'].dt.year)['name'].count()

born_date
1828.0      1
1831.0      2
1833.0      1
1836.0      1
1837.0      1
         ... 
2005.0    163
2006.0     17
2007.0      3
2008.0      3
2009.0      1
Name: name, Length: 177, dtype: int64

In [182]:
bios.groupby(bios['born_date'].dt.year)['name'].count().reset_index()

Unnamed: 0,born_date,name
0,1828.0,1
1,1831.0,2
2,1833.0,1
3,1836.0,1
4,1837.0,1
...,...,...
172,2005.0,163
173,2006.0,17
174,2007.0,3
175,2008.0,3


In [183]:
bios.groupby(bios['born_date'].dt.year)['name'].count().reset_index().sort_values('name', ascending=False)

Unnamed: 0,born_date,name
139,1972.0,2231
152,1985.0,2227
140,1973.0,2216
138,1971.0,2205
137,1970.0,2174
...,...,...
4,1837.0,1
2,1833.0,1
6,1839.0,1
12,1845.0,1


### Advanced Functionality

In [185]:
coffee['yesterday_revenue'] = coffee['revenue'].shift(1)

In [186]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue,yesterday_revenue
0,Monday,Espresso,15.0,3.99,99.75,
1,Monday,Latte,15.0,5.99,89.85,99.75
4,Wednesday,Espresso,35.0,3.99,139.65,89.85
5,Wednesday,Latte,25.0,5.99,149.75,139.65
6,Thursday,Espresso,40.0,3.99,159.6,149.75
7,Thursday,Latte,30.0,5.99,179.7,159.6
8,Friday,Espresso,45.0,3.99,179.55,179.7
9,Friday,Latte,35.0,5.99,209.65,179.55
10,Saturday,Espresso,45.0,3.99,179.55,209.65
11,Saturday,Latte,35.0,5.99,209.65,179.55


In [188]:
coffee['pct_change'] = coffee['revenue'] / coffee['yesterday_revenue'] * 100

In [189]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue,yesterday_revenue,pct_change
0,Monday,Espresso,15.0,3.99,99.75,,
1,Monday,Latte,15.0,5.99,89.85,99.75,90.075188
4,Wednesday,Espresso,35.0,3.99,139.65,89.85,155.42571
5,Wednesday,Latte,25.0,5.99,149.75,139.65,107.232367
6,Thursday,Espresso,40.0,3.99,159.6,149.75,106.577629
7,Thursday,Latte,30.0,5.99,179.7,159.6,112.593985
8,Friday,Espresso,45.0,3.99,179.55,179.7,99.916528
9,Friday,Latte,35.0,5.99,209.65,179.55,116.764133
10,Saturday,Espresso,45.0,3.99,179.55,209.65,85.642738
11,Saturday,Latte,35.0,5.99,209.65,179.55,116.764133


In [190]:
bios['height_rank'] = bios['height_cm'].rank(ascending=False)

In [191]:
bios.sort_values('height_rank')

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date,height_category,Category,height_rank
89070,89782,Yao Ming,1980-09-12,Xuhui District,Shanghai,CHN,People's Republic of China,226.0,141.0,,Tall,Heavyweight,1.0
6978,7013,Arvydas Sabonis,1964-12-19,Kaunas,Kaunas,LTU,Lithuania Soviet Union,223.0,122.0,,Tall,Heavyweight,2.5
5781,5804,Tommy Burleson,1952-02-24,Crossnore,North Carolina,USA,United States,223.0,102.0,,Tall,Heavyweight,2.5
5673,5696,Gunther Behnke,1963-01-19,Leverkusen,Nordrhein-Westfalen,GER,Germany,221.0,114.0,,Tall,Heavyweight,5.0
89075,89787,Roberto Dueñas,1975-11-01,Madrid,Madrid,ESP,Spain,221.0,137.0,,Tall,Heavyweight,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
145490,149217,Sin Ye-Chan,1995-06-13,,,,Republic of Korea,,,,Tall,Heavyweight,
145491,149218,Matthew Wepke,1989-12-05,,,,Jamaica,,,,Tall,Heavyweight,
145492,149219,Carlos García-Ordóñez,1927-04-24,La Habana (Havana),Ciudad de La Habana,CUB,Cuba,,,2019-11-24,Tall,Heavyweight,
145493,149220,Landysh Falyakhova,1998-08-31,Dva Polya Artash,Respublika Tatarstan,RUS,ROC,,,,Tall,Heavyweight,


In [193]:
coffee['cumulative_revenue'] = coffee['revenue'].cumsum()

In [194]:
coffee

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue,yesterday_revenue,pct_change,cumulative_revenue
0,Monday,Espresso,15.0,3.99,99.75,,,99.75
1,Monday,Latte,15.0,5.99,89.85,99.75,90.075188,189.6
4,Wednesday,Espresso,35.0,3.99,139.65,89.85,155.42571,329.25
5,Wednesday,Latte,25.0,5.99,149.75,139.65,107.232367,479.0
6,Thursday,Espresso,40.0,3.99,159.6,149.75,106.577629,638.6
7,Thursday,Latte,30.0,5.99,179.7,159.6,112.593985,818.3
8,Friday,Espresso,45.0,3.99,179.55,179.7,99.916528,997.85
9,Friday,Latte,35.0,5.99,209.65,179.55,116.764133,1207.5
10,Saturday,Espresso,45.0,3.99,179.55,209.65,85.642738,1387.05
11,Saturday,Latte,35.0,5.99,209.65,179.55,116.764133,1596.7


In [195]:
latte = coffee[coffee['Coffee Type'] == "Latte"].copy()

In [197]:
latte['3day'] = latte['Units Sold'].rolling(3).sum()

In [198]:
latte

Unnamed: 0,Day,Coffee Type,Units Sold,price,revenue,yesterday_revenue,pct_change,cumulative_revenue,3day
1,Monday,Latte,15.0,5.99,89.85,99.75,90.075188,189.6,
5,Wednesday,Latte,25.0,5.99,149.75,139.65,107.232367,479.0,
7,Thursday,Latte,30.0,5.99,179.7,159.6,112.593985,818.3,70.0
9,Friday,Latte,35.0,5.99,209.65,179.55,116.764133,1207.5,90.0
11,Saturday,Latte,35.0,5.99,209.65,179.55,116.764133,1596.7,100.0
13,Sunday,Latte,35.0,5.99,209.65,179.55,116.764133,1985.9,105.0


### New Functionality

In [199]:
pd.__version__

'2.2.3'

In [200]:
# v1 = numpy backend
# v2 = pyarrow backend

results_numpy = pd.read_csv('./data/results.csv')
results_arrow = pd.read_csv('./data/results.csv', engine='pyarrow', dtype_backend='pyarrow')

In [202]:
results_numpy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308408 entries, 0 to 308407
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   year        305807 non-null  float64
 1   type        305807 non-null  object 
 2   discipline  308407 non-null  object 
 3   event       308408 non-null  object 
 4   as          308408 non-null  object 
 5   athlete_id  308408 non-null  int64  
 6   noc         308407 non-null  object 
 7   team        121714 non-null  object 
 8   place       283193 non-null  float64
 9   tied        308408 non-null  bool   
 10  medal       44139 non-null   object 
dtypes: bool(1), float64(2), int64(1), object(7)
memory usage: 23.8+ MB


In [None]:
results_arrow.info() # biggest optimization is string handling, numpy not designed for strings

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308408 entries, 0 to 308407
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype          
---  ------      --------------   -----          
 0   year        305807 non-null  double[pyarrow]
 1   type        305807 non-null  string[pyarrow]
 2   discipline  308407 non-null  string[pyarrow]
 3   event       308408 non-null  string[pyarrow]
 4   as          308408 non-null  string[pyarrow]
 5   athlete_id  308408 non-null  int64[pyarrow] 
 6   noc         308407 non-null  string[pyarrow]
 7   team        121714 non-null  string[pyarrow]
 8   place       283193 non-null  double[pyarrow]
 9   tied        308408 non-null  bool[pyarrow]  
 10  medal       44139 non-null   string[pyarrow]
dtypes: bool[pyarrow](1), double[pyarrow](2), int64[pyarrow](1), string[pyarrow](7)
memory usage: 37.5 MB


In [None]:
# AI-driven world: Copilot chat

In [13]:
import pandas as pd

results_df = pd.read_csv('./data/results.csv')

results_df.sample(5)

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
123405,1972.0,Summer,Wrestling,"Middleweight, Freestyle, Men (Olympic)",Jesús Blanco,57099,ARG,,,False,
106001,2000.0,Summer,Swimming (Aquatics),"100 metres Freestyle, Men (Olympic)",Yoav Bruck,48358,ISR,,38.0,False,
47179,1960.0,Summer,Football (Football),"Football, Men (Olympic)",Flemming Nielsen,24669,DEN,Denmark,2.0,False,Silver
296581,2020.0,Winter,Nordic Combined (Skiing),Ski Jumping (Skiing),Niklas Malacinski,139843,USA,USA,,True,
21606,1900.0,Summer,Equestrian Driving (Equestrian),"Four-In-Hand Competition, Open (Olympic)",Georges Nagelmackers,12478,BEL,,1.0,False,Gold


In [4]:
bios_df = pd.read_csv('./data/bios.csv')
bios_df.sample(5)

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
125080,127423,Stanislav Benyov,1991-04-19,Sofia,Sofia Grad,BUL,Bulgaria,189.0,90.0,
11407,11467,Barbara Stracey,1953-11-06,Montréal,Québec,CAN,Canada,155.0,50.0,
12889,12960,Carl von Moers,1871-12-09,Neuwied,Rheinland-Pfalz,GER,Germany,,,1957-05-26
97006,97786,Ron Steele,1953-08-19,Leavenworth,Washington,USA,United States,183.0,80.0,
37281,37577,Steffen Bogs,1965-10-08,Rostock,Mecklenburg-Vorpommern,GER,East Germany,189.0,89.0,


In [27]:
ph = bios_df[bios_df['born_country'] == 'PHI']
ph.sample(10)

Unnamed: 0,athlete_id,name,born_date,born_city,born_region,born_country,NOC,height_cm,weight_kg,died_date
7625,7666,José Padilla,1911-07-16,Plaridel,Bulacan,PHI,Philippines,,,1978-06-18
114936,116263,Hidilyn Diaz,1991-02-20,Zamboanga,Zamboanga del Sur,PHI,Philippines,149.0,53.0,
6293,6322,Robert Jaworski,1946-03-08,Baguio City,Benguet,PHI,Philippines,183.0,84.0,
124563,126870,Mark Barriga,1993-06-11,Danao,Cebu,PHI,Philippines,157.0,48.0,
7636,7677,Mariano Vélez,1925-01-01,Cagayan de Oro,Misamis Oriental,PHI,Philippines,,,2004-01-01
93397,94149,Donald Geisler,1978-10-06,Angeles City,Pampanga,PHI,Philippines,185.0,80.0,
6781,6814,Jun Papa,1945-07-14,Manila,Manila,PHI,Philippines,180.0,75.0,2005-10-27
49366,49723,Pedro Cayco,1932-05-10,Malabon,Malabon,PHI,Philippines,,,
6764,6797,Eddie Pacheco,1936-01-04,Manila,Manila,PHI,Philippines,175.0,79.0,2009-12-09
117626,119135,William Beier,1982-11-02,Manila,Manila,PHI,Germany,172.0,64.0,


In [29]:
hid = results_df[results_df['athlete_id']==116263]
hid

Unnamed: 0,year,type,discipline,event,as,athlete_id,noc,team,place,tied,medal
250024,2008.0,Summer,Weightlifting,"Lightweight, Women (Olympic)",Hidilyn Diaz,116263,PHI,,10.0,False,
250025,2012.0,Summer,Weightlifting,"Lightweight, Women (Olympic)",Hidilyn Diaz,116263,PHI,,,False,
250026,2016.0,Summer,Weightlifting,"Featherweight, Women (Olympic)",Hidilyn Diaz,116263,PHI,,2.0,False,Silver
250027,2020.0,Summer,Weightlifting,"Featherweight, Women (Olympic)",Hidilyn Diaz,116263,PHI,,1.0,False,Gold
