# Intro

# Importing Data
# start with importing messy data set but then use metal stuff for the rest of the teaching example

In [6]:
import requests

# URL locations of data
master_death_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/bands.csv"
master_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv"
master_world_pop = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/world_population_1960_2015.csv"

# Grab the metal bands data
req = requests.get(master_death_metal_bands)
death_metal_bands_data = req.text

# Grab the metal bands data
req = requests.get(master_metal_bands)
metal_bands_data = req.text

# Grab the world population data
req = requests.get(master_world_pop)
world_pop_data = req.text

In [7]:
import pandas as pd

## A Realistic Depiction of Getting Data into Python

Exciting! We have some fresh new cyclic voltammetry data to analyze. Fortuitously, `pandas` has a function called `read_csv` design for loading tabular data. Let's do it!

In [8]:
pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/cyclic_voltammetry_output.txt")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 1135: invalid start byte

Ahhh! Our loading failed terribly! Let's take a look at our file to see what might be amiss.

It looks like our data doesn't actually start until line 81, as indicated by "Nb header lines: 81" on the second line. May have been wise to look at our file first, but eh, lesson learned.

Ok, now we need to turn to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for help. Like a good programmer, I'll google it to find the key word arguments that we can use to modify `read_csv`.

... google "how to skip lines in pd read csv" ...

Aha! The keyword `skiprows` appears to be what we are looking for. It's description states "Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file." There are 81 lines, but the line numbers to skip are 0-indexed, which means that we will want `skiprows` to have a value of 80. Let's give it a try!

In [None]:
pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/cyclic_voltammetry_output.txt", skiprows=80, encoding='mac_roman')

The columns aren't separated and we have `\t` characters all over the place, but still, progress! The `\t` characters are the separators in our data file, meaning our file is `tab`-seperated. Even though `csv` stands for Comma Seperated Values, other seperator characters are also common.

... google "how to specify separator in pd" ...

Looks like we can specify the type of separator by including the `sep` keyword. Let's do it!

In [None]:
pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/cyclic_voltammetry_output.txt", skiprows=80, sep='\t', encoding='mac_roman')

Yay! We've successfully imported our DataFrame. Sometimes it just takes a little tinkering. We are going to move on to a nicer dataset for the rest of the workshop but hopefully this has given you a realistic view of how to troubleshoot your imports!

In the next cell, we will import our data directly from a file hosted on GitHub. This is no harder than loading a `.csv` file on our local computer. We'll use this sick death metal data going forward.

In [None]:
# Make a data frame
metal_bands_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv")
world_pop_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/world_population_1960_2015.csv")
bands_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/bands.csv")

# Creating Columns

Let's take a look at our data to see what we're working with! We can look at just the first set of lines with the `head()` function. 

In [9]:
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
0,1,('M') Inc.,United States,Unknown,2009.0,Death Metal,,2009-?
1,2,(sic),United States,Split-up,1993.0,Death Metal,,1993-1996
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-present
3,4,100 Suns,United States,Active,2004.0,Death Metal,,2004-present
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002


In [10]:
bands_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37723 entries, 0 to 37722
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         37723 non-null  int64  
 1   name       37723 non-null  object 
 2   country    37723 non-null  object 
 3   status     37723 non-null  object 
 4   formed_in  33392 non-null  float64
 5   genre      37723 non-null  object 
 6   theme      20179 non-null  object 
 7   active     33796 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 2.3+ MB


Looks like we know a bunch of information about each metal band! We have their `'name'`, their `'country'`, their `'genre'`...even the years that they were `'active'`! These are the columns of this data frame. Note that the `'id'` is different from the row number: `'id'` is a column in the data frame, so if we sorted the data differently, those would be reordered. 

Now let's check out our other data!

In [11]:
world_pop_df.head()

Unnamed: 0.1,Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,0,Aruba,54208.0,55435.0,56226.0,56697.0,57029.0,57360.0,57712.0,58049.0,...,100830.0,101218.0,101342.0,101416.0,101597.0,101936.0,102393.0,102921.0,103441.0,103889.0
1,1,Andorra,13414.0,14376.0,15376.0,16410.0,17470.0,18551.0,19646.0,20755.0,...,83373.0,84878.0,85616.0,85474.0,84419.0,82326.0,79316.0,75902.0,72786.0,70473.0
2,2,Afghanistan,8994793.0,9164945.0,9343772.0,9531555.0,9728645.0,9935358.0,10148841.0,10368600.0,...,25183615.0,25877544.0,26528741.0,27207291.0,27962207.0,28809167.0,29726803.0,30682500.0,31627506.0,32526562.0
3,3,Angola,5270844.0,5367287.0,5465905.0,5565808.0,5665701.0,5765025.0,5863568.0,5962831.0,...,18541467.0,19183907.0,19842251.0,20520103.0,21219954.0,21942296.0,22685632.0,23448202.0,24227524.0,25021974.0
4,4,Albania,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,...,2992547.0,2970017.0,2947314.0,2927519.0,2913021.0,2904780.0,2900247.0,2896652.0,2893654.0,2889167.0


Here, the `'Unnamed: 0'` column is 0-index instead of 1-indexed...this is why it's helpful to take a peek at the dataframe itself!

Ok, looks like the world population data is something we could use along with the bands data. Let's see if we can make a column called `'country_population'` in the bands dataframe that has the population of the country for that band. 

There are a couple different ways to add a column. If we had the data for the column as a list, we could do it like this:

`band_df["country_population"] = [283464, 1283389, ...]`

Or if we wanted to put this information at a particular spot in the dataframe, we could use the `insert()` function:

`band_df.insert(3, "country_population", [452342, 15425324, ...])`

However, our best option will be the `assign()` function, because this provides a place for us to specify how to fill up the column:

`band_df = band_df.assign(country_population = np.random.randint(10))`

Except we need to figure out how to fetch the actual population number, rather than filling in a random number, of course!

So, how do we get a particular element from our `world_pop_df`?

In [12]:
bands_df['country'].value_counts()

United States    7899
Germany          3318
Italy            1625
Brazil           1592
Sweden           1578
                 ... 
Mozambique          1
Curacao             1
Guyana              1
Kenya               1
Tajikistan          1
Name: country, Length: 135, dtype: int64

In [13]:
world_pop_df['Country Name']

0              Aruba
1            Andorra
2        Afghanistan
3             Angola
4            Albania
           ...      
264           Taiwan
265         Guernsey
266          Reunion
267    Åland Islands
268           Jersey
Name: Country Name, Length: 269, dtype: object

# Cleaning the data

Did you notice that when we looked at the first few rows of the bands_df there were 'NaN' values in the 'theme' row? Let's say that we were interested in looking at an analysis based on these themes, so having these 'NaN' values is a bit of a nuisance, let's remove the rows that have 'NaN' values for their theme. 



In [14]:
bands_df.dropna(inplace=True)
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-present
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002
5,6,13th Cadaver,United States,Changed name,2006.0,Death Metal,Death| Gore| Undead,2006-?| ?-2007 (as Splatter the Cadaver)| 2008...
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry| Art| Religion| Psyc...,1994-present
7,8,5th Column,United States,Active,2003.0,Death Metal,War| Death| Battles| Rape,2003-present


Also notice that the 'active' column has a ton of '|'s in it. Let's remove those.

In [15]:
bands_df = bands_df[bands_df["active"].str.contains(r"\|")==False]
bands_df = bands_df[bands_df["active"].str.contains(r"\?")==False]
bands_df = bands_df[bands_df["active"].str.contains(r"\(")==False]
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-present
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry| Art| Religion| Psyc...,1994-present
7,8,5th Column,United States,Active,2003.0,Death Metal,War| Death| Battles| Rape,2003-present
9,10,602,Russia,Active,2012.0,Death Metal,Cruelty of regimes| WWII| Death,2012-present


In [16]:
bands_df = bands_df.replace(to_replace='present', value='2022', regex=True)
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-2022
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry| Art| Religion| Psyc...,1994-2022
7,8,5th Column,United States,Active,2003.0,Death Metal,War| Death| Battles| Rape,2003-2022
9,10,602,Russia,Active,2012.0,Death Metal,Cruelty of regimes| WWII| Death,2012-2022


In [17]:
bands_df["theme"] = bands_df["theme"].str.split(r"\|").str[0]
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-2022
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry,1994-2022
7,8,5th Column,United States,Active,2003.0,Death Metal,War,2003-2022
9,10,602,Russia,Active,2012.0,Death Metal,Cruelty of regimes,2012-2022


In [18]:
bands_df["ended_in"] = bands_df["active"].str.split(r"\-").str[1]
bands_df.dropna(inplace=True)
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active,ended_in
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-2022,2022
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002,2002
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry,1994-2022,2022
7,8,5th Column,United States,Active,2003.0,Death Metal,War,2003-2022,2022
9,10,602,Russia,Active,2012.0,Death Metal,Cruelty of regimes,2012-2022,2022


In [19]:
bands_df["age"] = bands_df["ended_in"].astype(float) - bands_df["formed_in"].astype(float)
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active,ended_in,age
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-2022,2022,13.0
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002,2002,4.0
6,7,1917,Argentina,Active,1994.0,Death Metal,Dark Philosophical Poetry,1994-2022,2022,28.0
7,8,5th Column,United States,Active,2003.0,Death Metal,War,2003-2022,2022,19.0
9,10,602,Russia,Active,2012.0,Death Metal,Cruelty of regimes,2012-2022,2022,10.0


How do we get the specific countries out of our world population data frame to add population information to our bands dataframe? 

In [20]:
world_pop_df.loc[world_pop_df["Country Name"]=="France"]

Unnamed: 0.1,Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
74,74,France,46814237.0,47444751.0,48119649.0,48803680.0,49449403.0,50023774.0,50508717.0,50915456.0,...,63621376.0,64016229.0,64374990.0,64707044.0,65027512.0,65342776.0,65659790.0,65972097.0,66495940.0,66808385.0


In [21]:
world_pop_df.loc[world_pop_df["Country Name"]=="France"]["2015"]

74    66808385.0
Name: 2015, dtype: float64

In [22]:
bands_pop = bands_df["country"].value_counts().rename_axis("country").reset_index(name="num_bands")
bands_pop["population"] = " "
bands_pop.head()

Unnamed: 0,country,num_bands,population
0,United States,2460,
1,Germany,780,
2,Brazil,728,
3,Italy,499,
4,Sweden,438,


In [24]:
for ii in range(bands_pop.shape[0]):
	try:
		my_val = world_pop_df.loc[world_pop_df["Country Name"] == bands_pop.iloc[ii][0]]["2015"].values[0]
		bands_pop.at[ii, "population"]= my_val
	except:
		print("%s was weird" %bands_pop.iloc[ii][0])
		#print(bands_pop.iloc[ii])
bands_pop.head()

International was weird
Unknown was weird


Unnamed: 0,country,num_bands,population
0,United States,2460,321418820.0
1,Germany,780,81413145.0
2,Brazil,728,207847528.0
3,Italy,499,60802085.0
4,Sweden,438,9798871.0


# Summation Statistics

# Exporting Data