# Preparing the Game Summary Dataset

In [1]:
#Load packages, change pandas options and change directory
#We filter warnings because some cleaning returns the warning:
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
# Since this didn't affect the end results, I opted to ignore the warning
import warnings
warnings.simplefilter(action='ignore')
import os
import json
import pandas as pd
import re
import numpy as np
pd.options.display.max_colwidth = 125
pd.options.display.max_rows = 100
os.chdir('C:\\Users\\mhous\\scrap\\JeopardyProject')

In [2]:
#Load the scraped data
with open('games.json') as json_file: 
    data = json.load(json_file) 
#Load the data into a pandas dataframe
df = pd.DataFrame(data)
df.head()

Unnamed: 0,game_id,game_comments,contestant_1,contestant_1_bio,contestant_2,contestant_2_bio,returning_champion,returning_champion_bio,rc_score_J,c_2_score_J,c_1_score_J,rc_score_DJ,c_2_score_DJ,c_1_score_DJ,rc_score_F,c_2_score_F,c_1_score_F
0,"Show #3966 - Monday, November 26, 2001",[Clue dollar values are doubled.],Harold Skinner,", a teacher and playwright from Columbia, South Carolina",Geoffrey Zimmerman,", a lawyer from Toronto, Canada",Kristin Lawhead,", a multimedia artist from New Orleans, Louisiana","$6,600","$5,000","$4,000","$3,400","$3,800","$10,000",$0,"$7,600","$7,700"
1,"Show #7943 - Wednesday, March 6, 2019",[],Tim Varecka,", an engineer from Tucson, Arizona",Eric Eifrig,", a lawyer from Cincinnati, Ohio",Dana Wayne,", an educator from North Hollywood, California (whose 1-day cash winnings total $26,401)","$2,200","$3,200","$4,000","$10,600","$11,400","$14,400","$6,000","$1,599","$5,999"
2,"Show #4089 - Thursday, May 16, 2002","[Ben Tritle game 1., \r\nFirst game in which runners-up are awarded cash prizes ($2,000 for 2nd place and $1,000 for 3rd ...",Allison Owens,", a teacher from Houston, Texas",Ben Tritle,", an apartment manager from Los Angeles, California",Ronnie O'Rourke,", a homemaker from Marietta, Georgia (whose 1-day cash winnings total $2,000)","$8,500",$600,"$4,000","$19,900","$4,600","$16,800","$6,200","$6,600","$1,800"
3,"Show #4281 - Monday, March 24, 2003",[],Shawn Wilson,", a technical writer from Chatsworth, California",Donna Corbett,", an office manager from Plymouth, Massachusetts",Sara Glidden,", a college theater manager from Roxbury, Massachusetts (whose 2-day cash winnings total $27,950)","$3,000","$1,400","$4,800","$13,600","$4,200","$18,400","$8,600","$1,000","$9,599"
4,"Show #6089 - Thursday, February 17, 2011",[2011 Teen Tournament quarterfinal game 1.],Brandon Welch,", a senior from Grayson, Georgia",Kate Wadman,", a junior from Tucson, Arizona",Christian Ie,", a senior from Renton, Washington",$0,"$7,400","$4,000","$4,400","$24,800","$17,600",$1,"$14,400","$35,200"


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4282 entries, 0 to 4281
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   game_id                 4282 non-null   object
 1   game_comments           4282 non-null   object
 2   contestant_1            4282 non-null   object
 3   contestant_1_bio        4282 non-null   object
 4   contestant_2            4282 non-null   object
 5   contestant_2_bio        4282 non-null   object
 6   returning_champion      4282 non-null   object
 7   returning_champion_bio  4282 non-null   object
 8   rc_score_J              4282 non-null   object
 9   c_2_score_J             4282 non-null   object
 10  c_1_score_J             4282 non-null   object
 11  rc_score_DJ             4282 non-null   object
 12  c_2_score_DJ            4282 non-null   object
 13  c_1_score_DJ            4282 non-null   object
 14  rc_score_F              4282 non-null   object
 15  c_2_

Tasks for data preparation: 
1. Change index to the show number and reorder
2. Convert the 'game_comments' column to a string, rather than a list of strings
3. Fix each contestant's bio:
    1. Remove the ', ' from the start of every entry
    2. Make separate columns for occupation and location
    3. In addition to occupation and location for returning champion, also create a column for winning streak and total winnings
4. Remove the '$' from each round's final score for each player so it can be converted to float  (not int, some people wager cents for whatever reason)



1. Fixing the index

In [4]:
#Create column 'show_number' that splits the string into a list of strings, selects the second element of the list, and remove the #
df['show_number'] = df['game_id'].str.split(' ').str[1].str.strip('#')

In [5]:
#Set 'show_number' to the index, then sorts in ascending order
df = df.set_index(['show_number']).sort_index()
df.head()

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_1_bio,contestant_2,contestant_2_bio,returning_champion,returning_champion_bio,rc_score_J,c_2_score_J,c_1_score_J,rc_score_DJ,c_2_score_DJ,c_1_score_DJ,rc_score_F,c_2_score_F,c_1_score_F
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3966,"Show #3966 - Monday, November 26, 2001",[Clue dollar values are doubled.],Harold Skinner,", a teacher and playwright from Columbia, South Carolina",Geoffrey Zimmerman,", a lawyer from Toronto, Canada",Kristin Lawhead,", a multimedia artist from New Orleans, Louisiana","$6,600","$5,000","$4,000","$3,400","$3,800","$10,000",$0,"$7,600","$7,700"
3967,"Show #3967 - Tuesday, November 27, 2001",[],Kris MacCubbin,", a director of communication from Kensington, Maryland",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts",Harold Skinner,", a teacher and playwright from Columbia, South Carolina (whose 1-day cash winnings total $7,700)","$5,000","$7,000","$3,400","$13,400","$15,800","$7,800",$0,"$20,800","$12,200"
3968,"Show #3968 - Wednesday, November 28, 2001","[(Cheryl: This is Cheryl from the , Jeopardy!, Clue Crew, and we're at , Sea World, , where the entertainers get paid in...",Trish Miller,", an administrative assistant from Boston, Massachusetts",Susan Rathke,", a residential caregiver from Madison, Wisconsin",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts (whose 1-day cash winnings total $20,800)","$1,200","$7,000","$4,000","$11,200","$15,400","$8,000","$6,200","$23,400","$15,000"
3969,"Show #3969 - Thursday, November 29, 2001","[(Sarah: I'm Sarah of the Clue Crew. Today on , Jeopardy!, , , everything's fair game, . Stay tuned.)]",Adam Lipsius,", a video professor and filmmaker from New York, New York",Frank Stasio,", a data analyst from New York, New York",Susan Rathke,", a residential caregiver from Madison, Wisconsin (whose 1-day cash winnings total $23,400)","$2,200","$5,600","$4,600","$17,000","$15,000","$1,400","$13,000","$30,000","$2,300"
3970,"Show #3970 - Friday, November 30, 2001","[(Sofia: Hi, this is Sofia of the Clue Crew. Today, we give you a little , taste of New York City, .)]",Charan Brahma,", an attorney from Los Angeles, California",Lara Kierlin,", a pre-med student from Hermosa Beach, California",Frank Stasio,", a data analyst from New York, New York (whose 1-day cash winnings total $30,000)",-$800,"$5,200","$3,400","$12,000","$7,600","$12,000","$12,000","$3,199",$500


2. Fixing game comments

In [6]:
#Joins each element of the list into one string, then replaces '\r\n' with '. '
df['game_comments'] = [''.join(map(str, l)) for l in df['game_comments']]
df['game_comments'] = df['game_comments'].str.replace('\r\n', ' ')
df.head()

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_1_bio,contestant_2,contestant_2_bio,returning_champion,returning_champion_bio,rc_score_J,c_2_score_J,c_1_score_J,rc_score_DJ,c_2_score_DJ,c_1_score_DJ,rc_score_F,c_2_score_F,c_1_score_F
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3966,"Show #3966 - Monday, November 26, 2001",Clue dollar values are doubled.,Harold Skinner,", a teacher and playwright from Columbia, South Carolina",Geoffrey Zimmerman,", a lawyer from Toronto, Canada",Kristin Lawhead,", a multimedia artist from New Orleans, Louisiana","$6,600","$5,000","$4,000","$3,400","$3,800","$10,000",$0,"$7,600","$7,700"
3967,"Show #3967 - Tuesday, November 27, 2001",,Kris MacCubbin,", a director of communication from Kensington, Maryland",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts",Harold Skinner,", a teacher and playwright from Columbia, South Carolina (whose 1-day cash winnings total $7,700)","$5,000","$7,000","$3,400","$13,400","$15,800","$7,800",$0,"$20,800","$12,200"
3968,"Show #3968 - Wednesday, November 28, 2001","(Cheryl: This is Cheryl from the Jeopardy! Clue Crew, and we're at Sea World, where the entertainers get paid in fish. A...",Trish Miller,", an administrative assistant from Boston, Massachusetts",Susan Rathke,", a residential caregiver from Madison, Wisconsin",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts (whose 1-day cash winnings total $20,800)","$1,200","$7,000","$4,000","$11,200","$15,400","$8,000","$6,200","$23,400","$15,000"
3969,"Show #3969 - Thursday, November 29, 2001","(Sarah: I'm Sarah of the Clue Crew. Today on Jeopardy!, everything's fair game. Stay tuned.)",Adam Lipsius,", a video professor and filmmaker from New York, New York",Frank Stasio,", a data analyst from New York, New York",Susan Rathke,", a residential caregiver from Madison, Wisconsin (whose 1-day cash winnings total $23,400)","$2,200","$5,600","$4,600","$17,000","$15,000","$1,400","$13,000","$30,000","$2,300"
3970,"Show #3970 - Friday, November 30, 2001","(Sofia: Hi, this is Sofia of the Clue Crew. Today, we give you a little taste of New York City.)",Charan Brahma,", an attorney from Los Angeles, California",Lara Kierlin,", a pre-med student from Hermosa Beach, California",Frank Stasio,", a data analyst from New York, New York (whose 1-day cash winnings total $30,000)",-$800,"$5,200","$3,400","$12,000","$7,600","$12,000","$12,000","$3,199",$500


3. Cleaning Bios

In [7]:
# The second element of each list in the concat statement first removes ',' from each contestant's bio, then splits the string into a list
# Where the first element is their occuption, and the second is their location. 
# The .apply(pd.Series) converts the lists into series, and concat adds those series onto the dataframe

df = pd.concat([df, df['contestant_1_bio'].str.strip(', ').str.split('from').apply(pd.Series)], axis=1)
df = pd.concat([df, df['contestant_2_bio'].str.strip(',').str.split('from').apply(pd.Series)], axis=1)
df = pd.concat([df, df['returning_champion_bio'].str.strip(',').str.split('from').apply(pd.Series)], axis=1)
df.head(3)

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_1_bio,contestant_2,contestant_2_bio,returning_champion,returning_champion_bio,rc_score_J,c_2_score_J,...,c_1_score_F,0,1,2,0,1,2,0,1,2
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3966,"Show #3966 - Monday, November 26, 2001",Clue dollar values are doubled.,Harold Skinner,", a teacher and playwright from Columbia, South Carolina",Geoffrey Zimmerman,", a lawyer from Toronto, Canada",Kristin Lawhead,", a multimedia artist from New Orleans, Louisiana","$6,600","$5,000",...,"$7,700",a teacher and playwright,"Columbia, South Carolina",,a lawyer,"Toronto, Canada",,a multimedia artist,"New Orleans, Louisiana",
3967,"Show #3967 - Tuesday, November 27, 2001",,Kris MacCubbin,", a director of communication from Kensington, Maryland",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts",Harold Skinner,", a teacher and playwright from Columbia, South Carolina (whose 1-day cash winnings total $7,700)","$5,000","$7,000",...,"$12,200",a director of communication,"Kensington, Maryland",,an environmental scientist,"Boston, Massachusetts",,a teacher and playwright,"Columbia, South Carolina (whose 1-day cash winnings total $7,700)",
3968,"Show #3968 - Wednesday, November 28, 2001","(Cheryl: This is Cheryl from the Jeopardy! Clue Crew, and we're at Sea World, where the entertainers get paid in fish. A...",Trish Miller,", an administrative assistant from Boston, Massachusetts",Susan Rathke,", a residential caregiver from Madison, Wisconsin",Rebekah Lacey,", an environmental scientist from Boston, Massachusetts (whose 1-day cash winnings total $20,800)","$1,200","$7,000",...,"$15,000",an administrative assistant,"Boston, Massachusetts",,a residential caregiver,"Madison, Wisconsin",,an environmental scientist,"Boston, Massachusetts (whose 1-day cash winnings total $20,800)",


In [8]:
#Drop the old bios since they are redundant
#The syntax for bio's for college tournaments are different from regular games. In general, a contestant in a college tournament will be
#described as 'a [freshman/sophomore/...]' at 'University' from 'Home location'
#This means there are three columns created in the above functions, a column for occupation, for location/university, and NAN/location. 
#Since college tournaments are very few of the games, most elements of the third column and NAN I'm just ignoring this issue for now

df = df.drop(['contestant_1_bio', 'contestant_2_bio', 'returning_champion_bio', 2], axis=1)

In [9]:
#Rename the columns to reflect the contestant's bio
df.columns = ['game_id','game_comments','contestant_1','contestant_2', 'returning_champion','rc_score_J','c_2_score_J','c_1_score_J','rc_score_DJ','c_2_score_DJ','c_1_score_DJ','rc_score_F','c_2_score_F','c_1_score_F','contestant_1_job','contestant_1_location','contestant_2_job','contestant_2_location','returning_champion_job','returning_champion_location']

In [10]:
df.head()

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_2,returning_champion,rc_score_J,c_2_score_J,c_1_score_J,rc_score_DJ,c_2_score_DJ,c_1_score_DJ,rc_score_F,c_2_score_F,c_1_score_F,contestant_1_job,contestant_1_location,contestant_2_job,contestant_2_location,returning_champion_job,returning_champion_location
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
3966,"Show #3966 - Monday, November 26, 2001",Clue dollar values are doubled.,Harold Skinner,Geoffrey Zimmerman,Kristin Lawhead,"$6,600","$5,000","$4,000","$3,400","$3,800","$10,000",$0,"$7,600","$7,700",a teacher and playwright,"Columbia, South Carolina",a lawyer,"Toronto, Canada",a multimedia artist,"New Orleans, Louisiana"
3967,"Show #3967 - Tuesday, November 27, 2001",,Kris MacCubbin,Rebekah Lacey,Harold Skinner,"$5,000","$7,000","$3,400","$13,400","$15,800","$7,800",$0,"$20,800","$12,200",a director of communication,"Kensington, Maryland",an environmental scientist,"Boston, Massachusetts",a teacher and playwright,"Columbia, South Carolina (whose 1-day cash winnings total $7,700)"
3968,"Show #3968 - Wednesday, November 28, 2001","(Cheryl: This is Cheryl from the Jeopardy! Clue Crew, and we're at Sea World, where the entertainers get paid in fish. A...",Trish Miller,Susan Rathke,Rebekah Lacey,"$1,200","$7,000","$4,000","$11,200","$15,400","$8,000","$6,200","$23,400","$15,000",an administrative assistant,"Boston, Massachusetts",a residential caregiver,"Madison, Wisconsin",an environmental scientist,"Boston, Massachusetts (whose 1-day cash winnings total $20,800)"
3969,"Show #3969 - Thursday, November 29, 2001","(Sarah: I'm Sarah of the Clue Crew. Today on Jeopardy!, everything's fair game. Stay tuned.)",Adam Lipsius,Frank Stasio,Susan Rathke,"$2,200","$5,600","$4,600","$17,000","$15,000","$1,400","$13,000","$30,000","$2,300",a video professor and filmmaker,"New York, New York",a data analyst,"New York, New York",a residential caregiver,"Madison, Wisconsin (whose 1-day cash winnings total $23,400)"
3970,"Show #3970 - Friday, November 30, 2001","(Sofia: Hi, this is Sofia of the Clue Crew. Today, we give you a little taste of New York City.)",Charan Brahma,Lara Kierlin,Frank Stasio,-$800,"$5,200","$3,400","$12,000","$7,600","$12,000","$12,000","$3,199",$500,an attorney,"Los Angeles, California",a pre-med student,"Hermosa Beach, California",a data analyst,"New York, New York (whose 1-day cash winnings total $30,000)"


In [11]:
#Since the returning champion's bio follows the syntax "name from location whose n-day winnings total $X", the column for 
#returning champion's location includes the winning streak and cash winnings.
#The second element of the list splits the string on the location of the first '(', and the resulting list is a list of strings
#the first being location and the second the win streak and cash totals. 
#as above, this code converts each element of the list into a series then concats it to the dataframe
#Since two new columns are made, we drop the original column for redundancy

#Also, since some games a tournaments, the 'n-day winnings total $' won't be present, so those games will have NANs
df = pd.concat([df, df['returning_champion_location'].str.split('(').apply(pd.Series)], axis=1)
df = df.drop(['returning_champion_location'], axis=1)


In [12]:
#Relabel new columns
df.columns = ['game_id','game_comments','contestant_1','contestant_2', 'returning_champion','rc_score_J','c_2_score_J','c_1_score_J','rc_score_DJ','c_2_score_DJ','c_1_score_DJ','rc_score_F','c_2_score_F','c_1_score_F','contestant_1_job','contestant_1_location','contestant_2_job','contestant_2_location','returning_champion_job','returning_champion_location', 'returning_champion_streak']

In [13]:
#First line strips the closing ), then selects the sixth element, the cash winnings 
#Second line does the same but selects the win streak and removes the '-day'
df['returning_champion_winnings'] = df['returning_champion_streak'].str.strip(')').str.split().str[5]
df['returning_champion_streak'] = df['returning_champion_streak'].str.strip(')').str.split().str[1].str.split('-').str[0]

In [14]:
#Since all the locations are now identical, we can split them into two columns for city and state
df['contestant_1_state'] = df['contestant_1_location'].str.split(',').str[1]
df['contestant_1_city'] = df['contestant_1_location'].str.split(',').str[0]
df.drop(['contestant_1_location'], axis=1)
df['contestant_2_state'] = df['contestant_2_location'].str.split(',').str[1]
df['contestant_2_city'] = df['contestant_2_location'].str.split(',').str[0]
df.drop(['contestant_2_location'], axis=1)
df['returning_champion_state'] = df['returning_champion_location'].str.split(',').str[1]
df['returning_champion_city'] = df['returning_champion_location'].str.split(',').str[0]
df.drop(['returning_champion_location'], axis=1)

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_2,returning_champion,rc_score_J,c_2_score_J,c_1_score_J,rc_score_DJ,c_2_score_DJ,...,contestant_2_location,returning_champion_job,returning_champion_streak,returning_champion_winnings,contestant_1_state,contestant_1_city,contestant_2_state,contestant_2_city,returning_champion_state,returning_champion_city
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3966,"Show #3966 - Monday, November 26, 2001",Clue dollar values are doubled.,Harold Skinner,Geoffrey Zimmerman,Kristin Lawhead,"$6,600","$5,000","$4,000","$3,400","$3,800",...,"Toronto, Canada",a multimedia artist,,,South Carolina,Columbia,Canada,Toronto,Louisiana,New Orleans
3967,"Show #3967 - Tuesday, November 27, 2001",,Kris MacCubbin,Rebekah Lacey,Harold Skinner,"$5,000","$7,000","$3,400","$13,400","$15,800",...,"Boston, Massachusetts",a teacher and playwright,1,"$7,700",Maryland,Kensington,Massachusetts,Boston,South Carolina,Columbia
3968,"Show #3968 - Wednesday, November 28, 2001","(Cheryl: This is Cheryl from the Jeopardy! Clue Crew, and we're at Sea World, where the entertainers get paid in fish. A...",Trish Miller,Susan Rathke,Rebekah Lacey,"$1,200","$7,000","$4,000","$11,200","$15,400",...,"Madison, Wisconsin",an environmental scientist,1,"$20,800",Massachusetts,Boston,Wisconsin,Madison,Massachusetts,Boston
3969,"Show #3969 - Thursday, November 29, 2001","(Sarah: I'm Sarah of the Clue Crew. Today on Jeopardy!, everything's fair game. Stay tuned.)",Adam Lipsius,Frank Stasio,Susan Rathke,"$2,200","$5,600","$4,600","$17,000","$15,000",...,"New York, New York",a residential caregiver,1,"$23,400",New York,New York,New York,New York,Wisconsin,Madison
3970,"Show #3970 - Friday, November 30, 2001","(Sofia: Hi, this is Sofia of the Clue Crew. Today, we give you a little taste of New York City.)",Charan Brahma,Lara Kierlin,Frank Stasio,-$800,"$5,200","$3,400","$12,000","$7,600",...,"Hermosa Beach, California",a data analyst,1,"$30,000",California,Los Angeles,California,Hermosa Beach,New York,New York
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8259,"Show #8259 - Thursday, October 15, 2020",Kevin Walsh game 6.,Kristin Hucek,Daniel Lee,Kevin Walsh,"$4,400","$2,400","$4,200","$2,000","$12,000",...,"South Pasadena, California",a story analyst originally,5,"$111,301",California,San Francisco,California,South Pasadena,New Jersey,Williamstown
8260,"Show #8260 - Friday, October 16, 2020",,Aanchal Ramani,Aaron Ballett,Kristin Hucek,"$5,200","$5,300","$1,800","$8,000","$5,300",...,"Santa Barbara, California",an attorney,1,"$2,700",California,San Francisco,California,Santa Barbara,California,San Francisco
8261,"Show #8261 - Monday, October 19, 2020",,Joe Aquino,Nancy Bosecker,Kristin Hucek,"$3,600","$1,600","$3,600","$8,800","$8,000",...,"Peoria, Illinois",an attorney,2,"$8,000",California,Chula Vista,Illinois,Peoria,California,San Francisco
8262,"Show #8262 - Tuesday, October 20, 2020",,Maddie Kahan,Carlos Chaidez,Kristin Hucek,$800,$600,"$2,000","-$3,600","$20,400",...,"Burbank, California",an attorney,3,"$24,808",California,Agoura Hills,California,Burbank,California,San Francisco


In [15]:
#Reorder the columns so they make some sense
cols = ['game_id','game_comments','contestant_1', 'contestant_1_job', 'contestant_1_city', 'contestant_1_state', 'contestant_2', 'contestant_2_job', 'contestant_2_city', 'contestant_2_state', 'returning_champion','returning_champion_job', 'returning_champion_city', 'returning_champion_state', 'returning_champion_streak', 'returning_champion_winnings', 'c_1_score_J','c_2_score_J','rc_score_J', 'c_1_score_DJ', 'c_2_score_DJ', 'rc_score_DJ', 'c_1_score_F', 'c_2_score_F', 'rc_score_F']
df = df.reindex(columns=cols)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4282 entries, 3966 to 8263
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   game_id                      4282 non-null   object
 1   game_comments                4282 non-null   object
 2   contestant_1                 4282 non-null   object
 3   contestant_1_job             4282 non-null   object
 4   contestant_1_city            4279 non-null   object
 5   contestant_1_state           4176 non-null   object
 6   contestant_2                 4282 non-null   object
 7   contestant_2_job             4282 non-null   object
 8   contestant_2_city            4282 non-null   object
 9   contestant_2_state           4175 non-null   object
 10  returning_champion           4282 non-null   object
 11  returning_champion_job       4282 non-null   object
 12  returning_champion_city      4281 non-null   object
 13  returning_champion_state     4161 n

The null values in some of the columns reflect changes in biography syntax due to tournaments

4. Converting dollar values to floats

In [16]:
#First line converts each contestant's score after the first round, second line the returning champion's total cash winnings
df[df.columns[15:25]] = df[df.columns[15:25]].replace('[\$,]', '', regex=True).astype(float)
df.head()

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_1_job,contestant_1_city,contestant_1_state,contestant_2,contestant_2_job,contestant_2_city,contestant_2_state,...,returning_champion_winnings,c_1_score_J,c_2_score_J,rc_score_J,c_1_score_DJ,c_2_score_DJ,rc_score_DJ,c_1_score_F,c_2_score_F,rc_score_F
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3966,"Show #3966 - Monday, November 26, 2001",Clue dollar values are doubled.,Harold Skinner,a teacher and playwright,Columbia,South Carolina,Geoffrey Zimmerman,a lawyer,Toronto,Canada,...,,4000.0,5000.0,6600.0,10000.0,3800.0,3400.0,7700.0,7600.0,0.0
3967,"Show #3967 - Tuesday, November 27, 2001",,Kris MacCubbin,a director of communication,Kensington,Maryland,Rebekah Lacey,an environmental scientist,Boston,Massachusetts,...,7700.0,3400.0,7000.0,5000.0,7800.0,15800.0,13400.0,12200.0,20800.0,0.0
3968,"Show #3968 - Wednesday, November 28, 2001","(Cheryl: This is Cheryl from the Jeopardy! Clue Crew, and we're at Sea World, where the entertainers get paid in fish. A...",Trish Miller,an administrative assistant,Boston,Massachusetts,Susan Rathke,a residential caregiver,Madison,Wisconsin,...,20800.0,4000.0,7000.0,1200.0,8000.0,15400.0,11200.0,15000.0,23400.0,6200.0
3969,"Show #3969 - Thursday, November 29, 2001","(Sarah: I'm Sarah of the Clue Crew. Today on Jeopardy!, everything's fair game. Stay tuned.)",Adam Lipsius,a video professor and filmmaker,New York,New York,Frank Stasio,a data analyst,New York,New York,...,23400.0,4600.0,5600.0,2200.0,1400.0,15000.0,17000.0,2300.0,30000.0,13000.0
3970,"Show #3970 - Friday, November 30, 2001","(Sofia: Hi, this is Sofia of the Clue Crew. Today, we give you a little taste of New York City.)",Charan Brahma,an attorney,Los Angeles,California,Lara Kierlin,a pre-med student,Hermosa Beach,California,...,30000.0,3400.0,5200.0,-800.0,12000.0,7600.0,12000.0,500.0,3199.0,12000.0


In [17]:
#Create columns for final jeopardy wager:
df['c_1_wager'] = abs(df['c_1_score_DJ'] - df['c_1_score_F'])
df['c_2_wager'] = abs(df['c_2_score_DJ'] - df['c_2_score_F'])
df['rc_wager'] = abs(df['rc_score_DJ'] - df['rc_score_F'])
df['winning_score'] = df[['c_1_score_F', 'c_2_score_F', 'rc_score_F']].max(axis=1)

In [18]:
conditions = [
    ((df['c_1_score_F'] > df['c_2_score_F']) & (df['c_1_score_F'] > df['rc_score_F'])),
    ((df['c_2_score_F'] > df['c_1_score_F']) & (df['c_2_score_F'] > df['rc_score_F'])),
    ((df['rc_score_F'] > df['c_2_score_F']) & (df['rc_score_F'] > df['c_1_score_F'])),
]
values = ['contestant_1', 'contestant_2', 'returning_champion']
df['winning_contestant'] = np.select(conditions, values)
df['winning_contestant'].value_counts()

returning_champion    1935
contestant_2          1165
contestant_1          1153
0                       29
Name: winning_contestant, dtype: int64

In [19]:
for i in range(len(df)):
    if df['winning_contestant'][i] == "0":
        df['winning_contestant'][i] = 'Tied'

In [20]:
df['winning_contestant'].value_counts()

returning_champion    1935
contestant_2          1165
contestant_1          1153
Tied                    29
Name: winning_contestant, dtype: int64

In [21]:
#Replace 'returning_champion' entries in 'winning_contestant' with the winning contestant's name
for i in range(len(df)):
    if df['winning_contestant'][i] != "Tied":
        df['winning_contestant'][i] = df[df['winning_contestant'][i]][i]

In [22]:
df[df['winning_contestant'] == "Tied"].head()

Unnamed: 0_level_0,game_id,game_comments,contestant_1,contestant_1_job,contestant_1_city,contestant_1_state,contestant_2,contestant_2_job,contestant_2_city,contestant_2_state,...,c_2_score_DJ,rc_score_DJ,c_1_score_F,c_2_score_F,rc_score_F,c_1_wager,c_2_wager,rc_wager,winning_score,winning_contestant
show_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4150,"Show #4150 - Friday, September 20, 2002",2002 Back to School Week game 5.,Mike Scott,an eleven-year-old,Lake Villa,Illinois,David McIntyre,a twelve-year-old,Riverside,California,...,9000.0,9000.0,18000.0,18000.0,10801.0,7200.0,9000.0,1801.0,18000.0,Tied
4273,"Show #4273 - Wednesday, March 12, 2003",,Karen Turay,an environmental scientist,Arlington,Virginia,David Dayen,a video editor originally,Philadelphia,Pennsylvania,...,5200.0,18800.0,0.0,10400.0,10400.0,5200.0,5200.0,8400.0,10400.0,Tied
4457,"Show #4457 - Tuesday, January 13, 2004",Tom Walsh game 7. First 7-day champion.,Meg Wall-Wild,a publications editor,Madison,Wisconsin,Dave Fuller,a high school teacher,Midlothian,Virginia,...,16000.0,17300.0,24200.0,32000.0,32000.0,12000.0,16000.0,14700.0,32000.0,Tied
4487,"Show #4487 - Tuesday, February 24, 2004",Arthur Gandolfi game 3.,Janice Dooner Lynch,a homemaker,New York,New York,Sean Morris,a college professor,Whittier,California,...,10800.0,24600.0,27600.0,21500.0,27600.0,13800.0,10700.0,3000.0,27600.0,Tied
4928,"Show #4928 - Wednesday, February 1, 2006","(Jimmy: We're visiting one of the longest-running musicals in Broadway history.) (Cheryl: Come along and be our guest, ne...",Joanna Stromberg,an attorney,Bethesda,Maryland,Dave Halliday,a travel marketer,Williamsburg,Virginia,...,13200.0,13200.0,11200.0,0.0,11200.0,5600.0,13200.0,2000.0,11200.0,Tied


Looks like before there were tiebreakers, if contestants were tied they all won

Just need write the dataframe to a csv to use it for further use

In [23]:
df.to_csv(r'C:\\Users\\mhous\\scrap\\JeopardyProject\\games.csv', index = True)