# Project #2 - Data Analysis with Pandas
Lauren Campion
Olympic Games Dataset


In [2]:
import pandas as pd
df = pd.read_csv('/olympic_games.csv') 
sorted_df = df.sort_values(by='country')
print(sorted_df) 


FileNotFoundError: [Errno 2] No such file or directory: '/olympic_games.csv'

**numpy arrays** share many properties with Python lists

In [None]:
a = np.array([1, 3, 2])
len(a)
a.sort()
a

**arange** : In analogy with the list range() function, we can create array ranges in NumPy using np.arange()

In [None]:
r = np.arange(17)
r

Computations with numpy arrays are **much faster** than the corresponding operations with lists. Because NumPy itself is array-based, such computations **can also typically be expressed much more compactly**, without the need for loops or even comprehensions. In particular, NumPy arrays support vectorized operations, whereby we can (say) multiply every element in an array by a particular number all at once.

Import the numPy, Matplotlib and pandas libraries into the Python interpreter
Then load the data set for the Nobel Laureates and show the summary stats for numeric fields

In [None]:
[3 * i for i in r]   # This is Python can be expressed more compactly as 
3 * a                # in numpy

**timeit** - we can use the timeit library to compare execution speeds for normal
Python and numpy structures. It Python let's do some heavy computation and time it:

In [None]:
%%timeit
[i**2 for i in range(1000)]

using numpy?  -- You should see a huge difference in speeds!

In [None]:
%%timeit
np.arange(1000)**2

Numpy also provides support for **multi-dimensional arrays**

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]])
a

# **Data Visualization with Matplotlib**
Matplotlib is a powerful visualization tool for Python that can do an absurdly large number of awesome things.
The exact mechanics of getting Matplotlib plots to display varies widely depending on the exact details of your setup. The most explicit way to show plots, which works on most systems from the REPL, is to use the show() method:


In [None]:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-2, 2, 100)
fig, ax = plt.subplots(1, 1)
ax.plot(x, x*x)
plt.show()
%matplotlib inline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

nobel = pd.read_csv("laureates.csv")
nobel.describe()

In [None]:
nobel.head(10)

In [None]:
nobel.info()

Let’s use array square brackets and a boolean criterion on the "surname" column to find record for the Physics noble laureate Feynman 

In [None]:
nobel[nobel["surname"] == "Feynman"]

In [None]:
nobel[nobel["surname"] == "Feynman"].year

By using the correct index (i.e., 86), we can confirm that the value in that case is True:

In [None]:
(nobel["surname"] == "Feynman")[86]

Another method for getting the year is by specifying the column along with the boolean criterion, which we might try like this (only the most relevant line of output is shown). This returns just the overall id (in this case, 86) and the column of interest.

In [None]:
nobel.loc[nobel["surname"] == "Feynman", "year"]

Finding a record by substring.

In [None]:
nobel.loc[nobel["firstname"].str.contains("Kip")]

Let's do sring search on a surname that produces multiple values
>> nobel.loc[nobel["surname"].str.contains("Feynman")]  # fails because of too many NaNs
>> Below shows the culprits 

In [None]:
nobel.loc[nobel["surname"].isnull()]

Here's how to filter NaNs - using by passing the option na=False to contains():

In [None]:
nobel.loc[nobel["surname"].str.contains("Feynman", na=False)]

Although there’s only one Nobel laureate named “Feynman”, there are famously several named “Curie”
Let's see dind Curies in the laureates.csv dataset.

In [None]:
curies = nobel.loc[nobel["surname"].str.contains("Curie", na=False)]
curies

With the result assigned to the variable curies we can get the first name and surname for each Curie laureate as follows:

In [None]:
curies[["firstname", "surname"]]

Marie Skłodowska-Curie is the only person to win a Nobel Prize for two different sciences. 
Let’s use pandas to see if there are any other multiple Nobel prize winners.
We use groupby() to group the winners by name and then use the size() method to see how many there are:

In [None]:
nobel.groupby(["firstname", "surname"]).size()

Let's add sort_values() to find any multiple laureates:

In [None]:
nobel.groupby(["firstname", "surname"]).size().sort_values()

Even better is to use ID and groupby

In [None]:
laureates = nobel.groupby(["id", "firstname", "surname"])
sizes = laureates.size()
sizes[sizes > 1]

One of pandas’ greatest strengths is its ability to deal with times and time series, so let’s start by taking a look at selecting dates. One way we can do this is by searching for laureates by exact birthday as a string

In [None]:
nobel.loc[nobel["born"] == "1879-03-14"]

Einstein birthday March 14 or 3/14 is labeled as Pi Day in America -  the first three digits of π ≈ 3.14. Another popular Math constant is Tau = 2*π = 6.28 - which by calendar will be 6/28. Let's see if any laureates have this birthday

In [None]:
nobel.loc[nobel["born"].str.contains("06-08", na=False)]

We can narrow down by restricting the results to Nobel laureates in Physics using the & operator to perform a logical and as follows

In [None]:
nobel.loc[(nobel["born"].astype('string').str.contains("06-28")) & (nobel["category"] == "physics")]

Let’s take a look at the first record using iloc (“index location”) to find it by its index number, which is 79:

In [None]:
nobel.iloc[79]

Speaking of birthdates, the lifespans of Nobel laureates have been the subject of some scientific research over the years. An interesting exercise is to compute the lifespan of each nobel laurettes using the year of birth and year of death
We will do this by creating a new lifespan colunn in the dataframe and using NumPy's magic time delta
nobel["lifespan"] = (nobel["died"] - nobel["born"])/np.timedelta64(1, "Y")