# Intro

# Importing Data
# start with importing messy data set but then use metal stuff for the rest of the teaching example

In [1]:
import requests

# URL locations of data
master_death_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/bands.csv"
master_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv"
master_world_pop = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/world_population_1960_2015.csv"

# Grab the metal bands data
req = requests.get(master_death_metal_bands)
death_metal_bands_data = req.text

# Grab the metal bands data
req = requests.get(master_metal_bands)
metal_bands_data = req.text

# Grab the world population data
req = requests.get(master_world_pop)
world_pop_data = req.text

In [2]:
import pandas as pd

## A Realistic Depiction of Getting Data into Python

Exciting! We have some fresh new cyclic voltammetry data to analyze. Fortuitously, `pandas` has a function called `read_csv` design for loading tabular data. Let's do it!

In [3]:
pd.read_csv("cyclic_voltammetry_output.txt")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 1135: invalid start byte

Ahhh! Our loading failed terribly! Let's take a look at our file to see what might be amiss.

It looks like our data doesn't actually start until line 81, as indicated by "Nb header lines: 81" on the second line. May have been wise to look at our file first, but eh, lesson learned.

Ok, now we need to turn to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for help. Like a good programmer, I'll google it to find the key word arguments that we can use to modify `read_csv`.

... google "how to skip lines in pd read csv" ...

Aha! The keyword `skiprows` appears to be what we are looking for. It's description states "Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file." There are 81 lines, but the line numbers to skip are 0-indexed, which means that we will want `skiprows` to have a value of 80. Let's give it a try!

In [4]:
pd.read_csv("cyclic_voltammetry_output.txt", skiprows=80, encoding='mac_roman')

Unnamed: 0,mode\tox/red\terror\tcontrol changes\tNs changes\tcounter inc.\tNs\ttime/s\tcontrol/V/mA\tEwe/V\tdq/mA.h\tEce/V\tP/W\t<I>/mA\tEwe-Ece/V\tx\t(Q-Qo)/mA.h\tCapacity/mA.h
0,3\t1\t0\t0\t0\t0\t0\t0.0002\t0\t3.13411832\t0\...
1,3\t1\t0\t0\t0\t0\t0\t60.0002\t0\t3.13436651\t0...
2,3\t1\t0\t0\t0\t0\t0\t120.0002\t0\t3.13472915\t...
3,3\t1\t0\t0\t0\t0\t0\t180.0002\t0\t3.13482451\t...
4,3\t1\t0\t0\t0\t0\t0\t240.0002\t0\t3.13497734\t...
...,...
23215,1\t0\t0\t0\t0\t0\t1\t417662.4821\t-0.01425\t2....
23216,1\t0\t0\t0\t0\t0\t1\t417699.5241\t-0.01425\t2....
23217,1\t0\t0\t0\t0\t0\t1\t417733.6881\t-0.01425\t2....
23218,1\t0\t0\t0\t0\t0\t1\t417733.6951\t-0.01425\t2....


The columns aren't separated and we have `\t` characters all over the place, but still, progress! The `\t` characters are the separators in our data file, meaning our file is `tab`-seperated. Even though `csv` stands for Comma Seperated Values, other seperator characters are also common.

... google "how to specify separator in pd" ...

Looks like we can specify the type of separator by including the `sep` keyword. Let's do it!

In [5]:
pd.read_csv("cyclic_voltammetry_output.txt", skiprows=80, sep='\t', encoding='mac_roman')

Unnamed: 0,mode,ox/red,error,control changes,Ns changes,counter inc.,Ns,time/s,control/V/mA,Ewe/V,dq/mA.h,Ece/V,P/W,<I>/mA,Ewe-Ece/V,x,(Q-Qo)/mA.h,Capacity/mA.h
0,3,1,0,0,0,0,0,0.0002,0.00000,3.134118,0.000000e+00,-0.002699,0.000000,0.000000,3.136817,0.000000,0.000000,0.000000
1,3,1,0,0,0,0,0,60.0002,0.00000,3.134367,0.000000e+00,-0.002604,0.000000,0.000000,3.136970,0.000000,0.000000,0.000000
2,3,1,0,0,0,0,0,120.0002,0.00000,3.134729,0.000000e+00,-0.002527,0.000000,0.000000,3.137256,0.000000,0.000000,0.000000
3,3,1,0,0,0,0,0,180.0002,0.00000,3.134825,0.000000e+00,-0.002527,0.000000,0.000000,3.137352,0.000000,0.000000,0.000000
4,3,1,0,0,0,0,0,240.0002,0.00000,3.134977,0.000000e+00,-0.002355,0.000000,0.000000,3.137333,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23215,1,0,0,0,0,0,1,417662.4821,-0.01425,2.021435,-1.447666e-04,0.016924,0.000029,-0.014246,2.004511,1.917172,-0.328737,0.135839
23216,1,0,0,0,0,0,1,417699.5241,-0.01425,2.020919,-1.465844e-04,0.016771,0.000029,-0.014246,2.004148,1.918027,-0.328884,0.135986
23217,1,0,0,0,0,0,1,417733.6881,-0.01425,2.020404,-1.351955e-04,0.016962,0.000029,-0.014246,2.003442,1.918815,-0.329019,0.136121
23218,1,0,0,0,0,0,1,417733.6951,-0.01425,2.020938,-2.770032e-08,0.016790,0.000029,-0.014246,2.004148,1.918815,-0.329019,0.136121


Yay! We've successfully imported our DataFrame. Sometimes it just takes a little tinkering. We are going to move on to a nicer dataset for the rest of the workshop but hopefully this has given you a realistic view of how to troubleshoot your imports!

In the next cell, we will import our data directly from a file hosted on GitHub. This is no harder than loading a `.csv` file on our local computer. We'll use this sick death metal data going forward.

In [6]:
# Make a data frame
metal_bands_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv")
world_pop_df = pd.read_csv("world_population_1960_2015.csv")
bands_df = pd.read_csv("bands.csv")

# Creating Columns

Let's take a look at our data to see what we're working with! We can look at just the first set of lines with the `head()` function. 

In [7]:
metal_bands_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,band_name,fans,formed,origin,split,style
0,0,0,Iron Maiden,4195,1975,United Kingdom,-,"New wave of british heavy,Heavy"
1,1,1,Opeth,4147,1990,Sweden,1990,"Extreme progressive,Progressive rock,Progressive"
2,2,2,Metallica,3712,1981,USA,-,"Heavy,Bay area thrash"
3,3,3,Megadeth,3105,1983,USA,1983,"Thrash,Heavy,Hard rock"
4,4,4,Amon Amarth,3054,1988,Sweden,-,Melodic death


In [8]:
metal_bands_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    5000 non-null   int64 
 1   Unnamed: 0.1  5000 non-null   int64 
 2   band_name     5000 non-null   object
 3   fans          5000 non-null   int64 
 4   formed        5000 non-null   object
 5   origin        4992 non-null   object
 6   split         5000 non-null   object
 7   style         5000 non-null   object
dtypes: int64(3), object(5)
memory usage: 312.6+ KB


Looks like we know a bunch of information about each metal band! We have their `'name'`, their `'country'`, their `'genre'`...even the years that they were `'active'`! These are the columns of this data frame. Note that the `'id'` is different from the row number: `'id'` is a column in the data frame, so if we sorted the data differently, those would be reordered. 

Now let's check out our other data!

In [9]:
world_pop_df.head()

Unnamed: 0.1,Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,0,Aruba,54208.0,55435.0,56226.0,56697.0,57029.0,57360.0,57712.0,58049.0,...,100830.0,101218.0,101342.0,101416.0,101597.0,101936.0,102393.0,102921.0,103441.0,103889.0
1,1,Andorra,13414.0,14376.0,15376.0,16410.0,17470.0,18551.0,19646.0,20755.0,...,83373.0,84878.0,85616.0,85474.0,84419.0,82326.0,79316.0,75902.0,72786.0,70473.0
2,2,Afghanistan,8994793.0,9164945.0,9343772.0,9531555.0,9728645.0,9935358.0,10148841.0,10368600.0,...,25183615.0,25877544.0,26528741.0,27207291.0,27962207.0,28809167.0,29726803.0,30682500.0,31627506.0,32526562.0
3,3,Angola,5270844.0,5367287.0,5465905.0,5565808.0,5665701.0,5765025.0,5863568.0,5962831.0,...,18541467.0,19183907.0,19842251.0,20520103.0,21219954.0,21942296.0,22685632.0,23448202.0,24227524.0,25021974.0
4,4,Albania,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,...,2992547.0,2970017.0,2947314.0,2927519.0,2913021.0,2904780.0,2900247.0,2896652.0,2893654.0,2889167.0


In [10]:
world_pop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269 entries, 0 to 268
Data columns (total 58 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    269 non-null    int64  
 1   Country Name  269 non-null    object 
 2   1960          264 non-null    float64
 3   1961          260 non-null    float64
 4   1962          260 non-null    float64
 5   1963          260 non-null    float64
 6   1964          260 non-null    float64
 7   1965          262 non-null    float64
 8   1966          260 non-null    float64
 9   1967          260 non-null    float64
 10  1968          260 non-null    float64
 11  1969          260 non-null    float64
 12  1970          264 non-null    float64
 13  1971          260 non-null    float64
 14  1972          260 non-null    float64
 15  1973          260 non-null    float64
 16  1974          260 non-null    float64
 17  1975          264 non-null    float64
 18  1976          260 non-null    

Here, the `'Unnamed: 0'` column is 0-index instead of 1-indexed...this is why it's helpful to take a peek at the dataframe itself!

Ok, looks like the world population data is something we could use along with the bands data. Let's see if we can make a column called `'country_population'` in the bands DataFrame that has the population of the country for that band. 

Let's make sure first that we actually have population data for all the countries in our bands DataFrame. We'll use the function `isin()`. 

In [21]:
has_country = bands_df['country'].isin(world_pop_df['Country Name'])
country_not_found = bands_df['country'][~has_country]
country_not_found.value_counts()

International    170
Unknown           12
Name: country, dtype: int64

So we need to make sure to skip any rows that say "International" or "Unknown". 

There are a couple different ways to add a column. If we had the data for the column as a list, we could do it like this:

`bands_df["country_population"] = [283464, 1283389, ...]`

Or if we wanted to put this information at a particular spot in the DataFrame, we could use the `insert()` function:

`bands_df.insert(3, "country_population", [452342, 15425324, ...])`

However, our best option will be the `assign()` function, because this provides a place for us to specify how to fill up the column:

`bands_df = bands_df.assign(country_population = np.random.randint(10))`

Actually, we're combining information from 2 different DataFrames, so we might actually be best off to use `merge()`, `join()`, or `concat()`. 

https://realpython.com/pandas-merge-join-and-concat/

In [None]:
world_pop_df['']

In [None]:
bands_df.assign(country_population = )

# DEBUGGING SECTION

Goal is to figure out the country names in world_pop_df and the country names in metal_bands_df

Or was it the country names in band_df?

In [12]:
world_pop_df['Country Name']

0              Aruba
1            Andorra
2        Afghanistan
3             Angola
4            Albania
           ...      
264           Taiwan
265         Guernsey
266          Reunion
267    Åland Islands
268           Jersey
Name: Country Name, Length: 269, dtype: object

Figure out how to split up the comma-separated country lists into an array (if we are using metal_bands_df)

Use this:

`df[['origin_1', 'origin_2']] = df.Name.str.split(expand=True)`

In [13]:
metal_bands_df['origin'].value_counts().index

Index(['USA', 'Sweden', 'Germany', 'United Kingdom', 'Finland', 'Norway',
       'France', 'Italy', 'Canada', 'The Netherlands',
       ...
       'Tunisia, France', 'Greece, USA', 'Israel, Germany', 'Macedonia',
       'Portugal, United Kingdom', 'Australia, United Kingdom',
       'Sweden, Finland', 'Hungary, United Kingdom', 'Colombia, USA',
       'Greenland'],
      dtype='object', length=113)

In [14]:
print(bands_df['country'].value_counts()['Taiwan'])
print(bands_df['country'].value_counts()['Guernsey'])
print(bands_df['country'].value_counts()['Jersey'])
print(bands_df['country'].value_counts()['Åland Islands'])
print(bands_df['country'].value_counts()['Reunion'])
print(bands_df['country'].value_counts()['International'])
print(bands_df['country'].value_counts()['Unknown'])

30
4
1
2
3
170
12


Comparing the countries of bands_df to the countries of world_pop_df

In [15]:
'Aruba' in world_pop_df['Country Name'].unique()

True

In [16]:
'Arba' in world_pop_df['Country Name'].unique()

False

In [17]:
'Russia' in world_pop_df['Country Name'].unique()

True

In [18]:
for country_name in bands_df['country'].value_counts().index:
    if country_name not in world_pop_df['Country Name'].unique():
        print(f"{country_name} not found")

International not found
Unknown not found


In [19]:
has_country = bands_df['country'].isin(world_pop_df['Country Name'])
bands_df['country'][~has_country].value_counts()

International    170
Unknown           12
Name: country, dtype: int64

### Now we know which countries we need to hunt down! Changes to make:

#### For the world_pop_df:

Russian Federation -> Russia

Slovak Republic -> Slovakia

Venezuela, RB -> Venezuela

Iran, Islamic Rep. -> Iran

Brunei Darussalam -> Brunei

Syrian Arab Republic -> Syria

Egypt, Arab Rep. -> Egypt

Kyrgyz Republic -> Kyrgyzstan

Lao PDR -> Laos


#### For the bands_df:

International -> no value

Unknown -> no value

Korea| South -> Korea, Rep.

Macedonia (FYROM) -> Macedonia, FYR

Curaçao -> Curacao


#### No world pop data:

Taiwan

Guernsey

Reunion

Åland Islands

Jersey

In [20]:
world_pop_df['Country Name'].unique()

array(['Aruba', 'Andorra', 'Afghanistan', 'Angola', 'Albania',
       'Arab World', 'United Arab Emirates', 'Argentina', 'Armenia',
       'American Samoa', 'Antigua and Barbuda', 'Australia', 'Austria',
       'Azerbaijan', 'Burundi', 'Belgium', 'Benin', 'Burkina Faso',
       'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas, The',
       'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda',
       'Bolivia', 'Brazil', 'Barbados', 'Brunei', 'Bhutan', 'Botswana',
       'Central African Republic', 'Canada',
       'Central Europe and the Baltics', 'Switzerland', 'Channel Islands',
       'Chile', 'China', "Cote d'Ivoire", 'Cameroon', 'Congo, Rep.',
       'Colombia', 'Comoros', 'Cabo Verde', 'Costa Rica',
       'Caribbean small states', 'Cuba', 'Curacao', 'Cayman Islands',
       'Cyprus', 'Czech Republic', 'Germany', 'Djibouti', 'Dominica',
       'Denmark', 'Dominican Republic', 'Algeria',
       'East Asia & Pacific (excluding high income)',
       'Early-demographic dividend', 'E

### Figuring out how to make these changes

Just change the `world_pop` database names themselves. 

Add rows to it as well to have data for the 5 missing places. 

Put in a section that discusses cleaning data by removing the 'Unknown' and 'International' rows. 

# Column Operations


*   Removing empty cells
*   Accessing specific columns/cells
*   Applying functions (non-statistical) to a column
*   Creating a new column based on data from other columns
*   Re-ordering data? Visualizing data? 



# Summation Statistics

# Exporting Data