## Converting a .csv file into a MySQL database instance

In the following Notebook, we will import a .csv dataset into Pandas, clean & wrangle the relevant data, and finally write that data directly to MySQL!

First, let's set up the relevant imports for Pandas and SQLAlchemy:

In [151]:
import pandas as pd
import sqlalchemy as sql

# Creating our Connection String
connection_string = 'mysql://root:Homestar1!@localhost/school_db'
# Creating our SQL engine using connection_string
sql_engine = sql.create_engine(connection_string)

Next, let's import our Pokemon .csv file, and take a peek at the first few rows

In [152]:
pokemon = pd.read_csv('Pokemon.csv')
pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 75.9+ KB


The good news - this data is already very clean!  However, it looks like the column for Type 2 has hundreds of "null" values.  This is most likely representing Pokemon with no secondary type, but for our purposes, it would be nice to replace those "nulls" with a placeholder value.

First, let's take a look at just that "Type 2" column

In [153]:
pokemon["Type 2"]

0      Poison
1      Poison
2      Poison
3      Poison
4         NaN
        ...  
795     Fairy
796     Fairy
797     Ghost
798      Dark
799     Water
Name: Type 2, Length: 800, dtype: object

To "fill in" a value in place of NaN, we can utilize .fillna(value)

In [154]:
pokemon.fillna("NONE").head()

#To update our original DataFrame, we will add inplace=True to above statement
pokemon.fillna("NONE", inplace=True)
pokemon


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,NONE,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


Lastly, it looks like some of the names for different "variations" of pokemon need some reformatting.  First, we'll sort our Name column in descending order, so we see those extra-long names first

In [155]:
pokemon.sort_values('Name', key=lambda x:x.str.len(), ascending=False, inplace=True)
pokemon.head(20)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
124,115,KangaskhanMega Kangaskhan,Normal,NONE,590,105,125,100,60,100,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
154,142,AerodactylMega Aerodactyl,Rock,Flying,615,80,135,85,70,95,150,1,False
704,642,ThundurusIncarnate Forme,Electric,Flying,580,79,115,70,125,80,111,5,True
413,376,MetagrossMega Metagross,Steel,Psychic,700,80,145,150,105,110,110,3,False
702,641,TornadusIncarnate Forme,Flying,NONE,580,79,115,70,125,80,111,5,True
716,648,MeloettaPirouette Forme,Normal,Fighting,600,100,128,90,77,77,128,5,False
268,248,TyranitarMega Tyranitar,Rock,Dark,700,100,164,150,95,120,71,2,False
511,460,AbomasnowMega Abomasnow,Grass,Ice,594,90,132,105,132,105,30,4,False


First, we'll start with "Mega"
The syntax appears to be "NameMega Name".  We want to reformat to just "Mega name".  To do this, we will split the value on the string "Mega ", and replace it with "Mega " + "Name"
Replace utilizes regex as patterns for finding strings

In [156]:
#First, creating a mask of true/false values, indicating if the value contains "Mega "
mask = pokemon['Name'].str.contains("Mega ")
# Next, we will use .loc to target the cells that returned "True" from the above command
    # We will then replace the value in the "Name" column with "Mega " 
    # + the original cell value, split on "Mega ", adding the 1 index (the Pokemon name)
pokemon.loc[mask, "Name"] = "Mega " + pokemon.loc[mask,'Name'].str.split("Mega ").str[1]
# Finally, we will again print our pokemon dataframe to confirm the update was successful
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
124,115,Mega Kangaskhan,Normal,NONE,590,105,125,100,60,100,100,1,False
7,6,Mega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,Mega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
154,142,Mega Aerodactyl,Rock,Flying,615,80,135,85,70,95,150,1,False
704,642,ThundurusIncarnate Forme,Electric,Flying,580,79,115,70,125,80,111,5,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,95,Onix,Rock,Ground,385,35,45,160,30,45,70,1,False
537,480,Uxie,Psychic,NONE,580,75,75,130,75,130,95,4,True
133,124,Jynx,Ice,Psychic,455,65,50,35,115,95,95,1,False
165,151,Mew,Psychic,NONE,600,100,100,100,100,100,100,1,False


Perfect!  Let's repeat the process and see what other names we can clean up

First, let's sort by Name length again:

In [157]:
pokemon.sort_values('Name', key=lambda x:x.str.len(), ascending=False, inplace=True)
pokemon.head(20)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
704,642,ThundurusIncarnate Forme,Electric,Flying,580,79,115,70,125,80,111,5,True
708,645,LandorusIncarnate Forme,Ground,Flying,600,89,125,90,115,80,101,5,True
702,641,TornadusIncarnate Forme,Flying,NONE,580,79,115,70,125,80,111,5,True
615,555,DarmanitanStandard Mode,Fire,NONE,480,105,140,55,30,55,95,5,False
716,648,MeloettaPirouette Forme,Normal,Fighting,600,100,128,90,77,77,128,5,False
705,642,ThundurusTherian Forme,Electric,Flying,580,79,105,70,145,80,101,5,True
784,711,GourgeistAverage Size,Ghost,Grass,494,65,90,122,58,75,84,6,False
424,383,GroudonPrimal Groudon,Ground,Fire,770,100,180,160,150,90,90,3,True
544,487,GiratinaAltered Forme,Ghost,Dragon,680,150,100,120,100,120,90,4,True
709,645,LandorusTherian Forme,Ground,Flying,600,89,145,90,105,80,91,5,True


I see several instances of "Incarnate" and "Therian".  We'll repeat the process with both of these.  Desired output will be FORM + NAME

In [158]:
#First, creating a mask of true/false values, indicating if the value contains "Incarnate"
incarnate = pokemon['Name'].str.contains("Incarnate")
# Next, we will use .loc to target the cells that returned "True" from the above command
    # We will then replace the value in the "Name" column with "Mega " 
    # + the original cell value, split on "Incarnate", adding the 1 index (the Pokemon name)
pokemon.loc[incarnate, "Name"] = "Incarnate " + pokemon.loc[incarnate,"Name"].str.split("Incarnate ").str[0]
# Finally, we will again print our pokemon dataframe to confirm the update was successful
#pokemon
pokemon.sort_values('Name', key=lambda x:x.str.len(), ascending=False, inplace=True)
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
615,555,DarmanitanStandard Mode,Fire,NONE,480,105,140,55,30,55,95,5,False
716,648,MeloettaPirouette Forme,Normal,Fighting,600,100,128,90,77,77,128,5,False
705,642,ThundurusTherian Forme,Electric,Flying,580,79,105,70,145,80,101,5,True
780,710,PumpkabooAverage Size,Ghost,Grass,335,49,66,70,44,55,51,6,False
784,711,GourgeistAverage Size,Ghost,Grass,494,65,90,122,58,75,84,6,False


In [159]:
#First, creating a mask of true/false values, indicating if the value contains "Therian"
therian = pokemon['Name'].str.contains("Therian")
# Next, we will use .loc to target the cells that returned "True" from the above command
    # We will then replace the value in the "Name" column with "Mega " 
    # + the original cell value, split on "Therian", adding the 0 index (the Pokemon name)
pokemon.loc[therian, "Name"] = "Therian " + pokemon.loc[therian,'Name'].str.split("Therian").str[0]
# Finally, we will again print our pokemon dataframe to confirm the update was successful
pokemon


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
615,555,DarmanitanStandard Mode,Fire,NONE,480,105,140,55,30,55,95,5,False
716,648,MeloettaPirouette Forme,Normal,Fighting,600,100,128,90,77,77,128,5,False
705,642,Therian Thundurus,Electric,Flying,580,79,105,70,145,80,101,5,True
780,710,PumpkabooAverage Size,Ghost,Grass,335,49,66,70,44,55,51,6,False
784,711,GourgeistAverage Size,Ghost,Grass,494,65,90,122,58,75,84,6,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,95,Onix,Rock,Ground,385,35,45,160,30,45,70,1,False
537,480,Uxie,Psychic,NONE,580,75,75,130,75,130,95,4,True
93,86,Seel,Water,NONE,325,65,45,55,45,70,45,1,False
165,151,Mew,Psychic,NONE,600,100,100,100,100,100,100,1,False


For the sake of time, we'll stop here, but we could complete the process for remaining "alternate form" pokemon

Next, we'll write this DataFrame to a new SQL table

In [160]:
pokemon.to_sql("pokemon_database", sql_engine, if_exists='fail')

800

Finally, let's verify our data was saved successfully by querying our MySQL database for our newly created table

In [None]:
# Our raw SQL query, in this example will pull all records from the table school_db_course
query = "SELECT * from pokemon_database"
# Combining our query and engine together, and bringing the resulting data into a Pandas DataFrame
dataframe = pd.read_sql_query(query,sql_engine)
dataframe