# Intro

# Importing Data
# start with importing messy data set but then use metal stuff for the rest of the teaching example

In [3]:
import requests

# URL locations of data
master_death_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/bands.csv"
master_metal_bands = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv"
master_world_pop = "https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/world_population_1960_2015.csv"

# Grab the metal bands data
req = requests.get(master_death_metal_bands)
death_metal_bands_data = req.text

# Grab the metal bands data
req = requests.get(master_metal_bands)
metal_bands_data = req.text

# Grab the world population data
req = requests.get(master_world_pop)
world_pop_data = req.text

In [4]:
import pandas as pd

## A Realistic Depiction of Getting Data into Python

Exciting! We have some fresh new cyclic voltammetry data to analyze. Fortuitously, `pandas` has a function called `read_csv` design for loading tabular data. Let's do it!

In [5]:
pd.read_csv("cyclic_voltammetry_output.txt")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 1135: invalid start byte

Ahhh! Our loading failed terribly! Let's take a look at our file to see what might be amiss.

It looks like our data doesn't actually start until line 81, as indicated by "Nb header lines: 81" on the second line. May have been wise to look at our file first, but eh, lesson learned.

Ok, now we need to turn to the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for help. Like a good programmer, I'll google it to find the key word arguments that we can use to modify `read_csv`.

... google "how to skip lines in pd read csv" ...

Aha! The keyword `skiprows` appears to be what we are looking for. It's description states "Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file." There are 81 lines, but the line numbers to skip are 0-indexed, which means that we will want `skiprows` to have a value of 80. Let's give it a try!

In [10]:
pd.read_csv("cyclic_voltammetry_output.txt", skiprows=80, encoding='mac_roman')

Unnamed: 0,mode\tox/red\terror\tcontrol changes\tNs changes\tcounter inc.\tNs\ttime/s\tcontrol/V/mA\tEwe/V\tdq/mA.h\tEce/V\tP/W\t<I>/mA\tEwe-Ece/V\tx\t(Q-Qo)/mA.h\tCapacity/mA.h
0,3\t1\t0\t0\t0\t0\t0\t0.0002\t0\t3.13411832\t0\...
1,3\t1\t0\t0\t0\t0\t0\t60.0002\t0\t3.13436651\t0...
2,3\t1\t0\t0\t0\t0\t0\t120.0002\t0\t3.13472915\t...
3,3\t1\t0\t0\t0\t0\t0\t180.0002\t0\t3.13482451\t...
4,3\t1\t0\t0\t0\t0\t0\t240.0002\t0\t3.13497734\t...
...,...
23215,1\t0\t0\t0\t0\t0\t1\t417662.4821\t-0.01425\t2....
23216,1\t0\t0\t0\t0\t0\t1\t417699.5241\t-0.01425\t2....
23217,1\t0\t0\t0\t0\t0\t1\t417733.6881\t-0.01425\t2....
23218,1\t0\t0\t0\t0\t0\t1\t417733.6951\t-0.01425\t2....


The columns aren't separated and we have `\t` characters all over the place, but still, progress! The `\t` characters are the separators in our data file, meaning our file is `tab`-seperated. Even though `csv` stands for Comma Seperated Values, other seperator characters are also common.

... google "how to specify separator in pd" ...

Looks like we can specify the type of separator by including the `sep` keyword. Let's do it!

In [11]:
pd.read_csv("cyclic_voltammetry_output.txt", skiprows=80, sep='\t', encoding='mac_roman')

Unnamed: 0,mode,ox/red,error,control changes,Ns changes,counter inc.,Ns,time/s,control/V/mA,Ewe/V,dq/mA.h,Ece/V,P/W,<I>/mA,Ewe-Ece/V,x,(Q-Qo)/mA.h,Capacity/mA.h
0,3,1,0,0,0,0,0,0.0002,0.00000,3.134118,0.000000e+00,-0.002699,0.000000,0.000000,3.136817,0.000000,0.000000,0.000000
1,3,1,0,0,0,0,0,60.0002,0.00000,3.134367,0.000000e+00,-0.002604,0.000000,0.000000,3.136970,0.000000,0.000000,0.000000
2,3,1,0,0,0,0,0,120.0002,0.00000,3.134729,0.000000e+00,-0.002527,0.000000,0.000000,3.137256,0.000000,0.000000,0.000000
3,3,1,0,0,0,0,0,180.0002,0.00000,3.134825,0.000000e+00,-0.002527,0.000000,0.000000,3.137352,0.000000,0.000000,0.000000
4,3,1,0,0,0,0,0,240.0002,0.00000,3.134977,0.000000e+00,-0.002355,0.000000,0.000000,3.137333,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23215,1,0,0,0,0,0,1,417662.4821,-0.01425,2.021435,-1.447666e-04,0.016924,0.000029,-0.014246,2.004511,1.917172,-0.328737,0.135839
23216,1,0,0,0,0,0,1,417699.5241,-0.01425,2.020919,-1.465844e-04,0.016771,0.000029,-0.014246,2.004148,1.918027,-0.328884,0.135986
23217,1,0,0,0,0,0,1,417733.6881,-0.01425,2.020404,-1.351955e-04,0.016962,0.000029,-0.014246,2.003442,1.918815,-0.329019,0.136121
23218,1,0,0,0,0,0,1,417733.6951,-0.01425,2.020938,-2.770032e-08,0.016790,0.000029,-0.014246,2.004148,1.918815,-0.329019,0.136121


Yay! We've successfully imported our DataFrame. Sometimes it just takes a little tinkering. We are going to move on to a nicer dataset for the rest of the workshop but hopefully this has given you a realistic view of how to troubleshoot your imports!

In the next cell, we will import our data directly from a file hosted on GitHub. This is no harder than loading a `.csv` file on our local computer. We'll use this sick death metal data going forward.

In [14]:
# Make a data frame
metal_bands_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/metal_bands_2017.csv")
world_pop_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/world_population_1960_2015.csv")
bands_df = pd.read_csv("https://raw.githubusercontent.com/orioncohen/metal-bands-by-nation/main/bands.csv")

# Creating Columns

Let's take a look at our data to see what we're working with! We can look at just the first set of lines with the `head()` function. 

In [15]:
bands_df.head()

Unnamed: 0,id,name,country,status,formed_in,genre,theme,active
0,1,('M') Inc.,United States,Unknown,2009.0,Death Metal,,2009-?
1,2,(sic),United States,Split-up,1993.0,Death Metal,,1993-1996
2,3,.F.O.A.D.,France,Active,2009.0,Death Metal,Life and Death,2009-present
3,4,100 Suns,United States,Active,2004.0,Death Metal,,2004-present
4,5,12 Days of Anarchy,United States,Split-up,1998.0,Death Metal,Anarchy,1998-2002


Looks like we know a bunch of information about each metal band! We have their `'name'`, their `'country'`, their `'genre'`...even the years that they were `'active'`! These are the columns of this data frame. Note that the `'id'` is different from the row number: `'id'` is a column in the data frame, so if we sorted the data differently, those would be reordered. 

Now let's check out our other data!

In [16]:
world_pop_df.head()

Unnamed: 0.1,Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,0,Aruba,54208.0,55435.0,56226.0,56697.0,57029.0,57360.0,57712.0,58049.0,...,100830.0,101218.0,101342.0,101416.0,101597.0,101936.0,102393.0,102921.0,103441.0,103889.0
1,1,Andorra,13414.0,14376.0,15376.0,16410.0,17470.0,18551.0,19646.0,20755.0,...,83373.0,84878.0,85616.0,85474.0,84419.0,82326.0,79316.0,75902.0,72786.0,70473.0
2,2,Afghanistan,8994793.0,9164945.0,9343772.0,9531555.0,9728645.0,9935358.0,10148841.0,10368600.0,...,25183615.0,25877544.0,26528741.0,27207291.0,27962207.0,28809167.0,29726803.0,30682500.0,31627506.0,32526562.0
3,3,Angola,5270844.0,5367287.0,5465905.0,5565808.0,5665701.0,5765025.0,5863568.0,5962831.0,...,18541467.0,19183907.0,19842251.0,20520103.0,21219954.0,21942296.0,22685632.0,23448202.0,24227524.0,25021974.0
4,4,Albania,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,...,2992547.0,2970017.0,2947314.0,2927519.0,2913021.0,2904780.0,2900247.0,2896652.0,2893654.0,2889167.0


Here, the `'Unnamed: 0'` column is 0-index instead of 1-indexed...this is why it's helpful to take a peek at the dataframe itself!

Ok, looks like the world population data is something we could use along with the bands data. Let's see if we can make a column called `'country_population'` in the bands dataframe that has the population of the country for that band. 

There are a couple different ways to add a column. If we had the data for the column as a list, we could do it like this:

`band_df["country_population"] = [283464, 1283389, ...]`

Or if we wanted to put this information at a particular spot in the dataframe, we could use the `insert()` function:

`band_df.insert(3, "country_population", [452342, 15425324, ...])`

However, our best option will be the `assign()` function, because this provides a place for us to specify how to fill up the column:

`band_df = band_df.assign(country_population = np.random.randint(10))`

Except we need to figure out how to fetch the actual population number, rather than filling in a random number, of course!

So, how do we get a particular element from our `world_pop_df`?

In [22]:
world_pop_df['Country Name']

0                 Aruba
1               Andorra
2           Afghanistan
3                Angola
4               Albania
             ...       
259         Yemen, Rep.
260        South Africa
261    Congo, Dem. Rep.
262              Zambia
263            Zimbabwe
Name: Country Name, Length: 264, dtype: object

# Column Operations


*   Removing empty cells
*   Accessing specific columns/cells
*   Applying functions (non-statistical) to a column
*   Creating a new column based on data from other columns
*   Re-ordering data? Visualizing data? 



# Summation Statistics

# Exporting Data