# Data Science Mini-Project

Given the information from the previous sections, you should be equipped to read data into Pandas dataframes. Dataframes allow us to analyze and interpret large amounts of data. For this mini-project, we'll be looking at data from a website called kaggle.com. 

Kaggle is an awesome resource for all things data science, as they host public datasets, competitions, and tutorials. Their competitions tend to deal with solving real-world problems, for example, last year they ran a competition for determining ways to deal with Los Angeles' hiring crisis as their workforce retires (https://www.kaggle.com/c/data-science-for-good-city-of-los-angeles), and one for using news to predict stock prices (https://www.kaggle.com/c/two-sigma-financial-news).

**For today's mini-project, we are going to look at a dataset fom Kaggle, and using Pandas to structure, visualize, and interpret it.**

Because we want to recognize that not everyone will be interested in the same topics, we want to leave this project open-ended. In this example, we'll take a look at a CSV file pertaining to the success of different climbing routes on Mount Rainier, but, if this sounds boring or unappealing, we encourage you to search around and find a different dataset. 

## Our Goal

Our goal is to look at the data from previous Mt Rainier climbs, and use it to **determine which route is the safest**. While this may sound trivial, we should note that the data is separated by year. So, one route may have had a 100% success rate one year, despite not being the safest. 

## Getting Started

The data files should be available in the Github repo, and we'll do the work of importing them for you below. For our mini-project, we'll only be looking at the climbing statistics for the base project, and the weather data for the bonus. Our first step will be to load the data in from the CSV, into a dataframe. 

### Setup

***Please make sure to run the following blocks to install requirements and import your data!***

In [4]:
# Installing requirements
!pip install pandas
!pip install numpy
!pip install requests

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [5]:
import pandas as pd
import numpy as np
import requests

# Fetch data from GitHub
url = "https://raw.githubusercontent.com/HackBinghamton/DataScienceWorkshop/master/DataScience/climbing_statistics.csv"
raw_data = requests.get(url).text

# Write it to a local file
with open("climbing_statistics.csv", "w") as data_file:
    data_file.write(raw_data)

# Read it into a dataframe
dataframe = pd.read_csv('climbing_statistics.csv')
dataframe

Unnamed: 0,Date,Route,Attempted,Succeeded,Success Percentage
0,11/27/2015,Disappointment Cleaver,2,0,0.000000
1,11/21/2015,Disappointment Cleaver,3,0,0.000000
2,10/15/2015,Disappointment Cleaver,2,0,0.000000
3,10/13/2015,Little Tahoma,8,0,0.000000
4,10/9/2015,Disappointment Cleaver,2,0,0.000000
...,...,...,...,...,...
4072,1/16/2014,Little Tahoma,2,0,0.000000
4073,1/6/2014,Disappointment Cleaver,8,0,0.000000
4074,1/6/2014,Disappointment Cleaver,8,0,0.000000
4075,1/5/2014,Disappointment Cleaver,2,0,0.000000


### Challenge Questions

Here's some information that we can ascertain fairly easily with numpy/pandas. Can you figure out how to determine: 

1. How many possible routes are there?
2. How much time has elapsed between the first and last recorded climb attempts?
3. During what year were there the most climb attempts?

Here's a code box for you to work in:

In [None]:
# Your code here!

*Hint:* Look into .filter(), .groupBy(), and regex

*Hint 2:* There's a function called DatetimeIndex in pandas that could be useful


### Partial Solution

Hopefully you were able to answer the above questions, if not, here's a code snippet to answer question 3.

In [None]:
# Gets year with the most climb attempts
dataframe['Year'] = pd.DatetimeIndex(dataframe['Date']).year
groups = dataframe.groupby(['Year'])['Attempted'].agg('sum')
print(groups)

## Safest Route Project
So, we want to aggregate the number of successes / attempts for each result to determine which is the safest. We'll need to use many of the functions we used above in the Challenge section. Because we walked through the previous problems together, we won't attach the code to solve this problem. However, HackBU organizers will be happy to help out if you're struggling.

Here are some steps that should point you in the right direction:

1. Split the data into N separate routes.
2. Sum the number of attempts for each route.
3. Sum the number of successes for each route.
4. Divide the number of attempts by the number of successes for each.
5. Find the highest value from step 4.

*Hint:* Steps 2 and 3 can be consolidated into one. Can any other steps be consolidated too?

In [None]:
# Your code here!







## Result

If you've successfully found the safest route, you should have determined that the safest route is one of the above:
<br><br><br>
A. Ptarmigan RIngraham Directge <br>
B. Gibralter Ledges<br>
C. Wilson Headwall<br>
D. Tahoma Cleaver<br>

## Correct Answer, Bonus, and Next Steps

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

The correct answer is Tahoma Cleaver, with a success rate of 100%. It's important to note that we didn't control
for the number of attempts, so this could have impacted our results. Is there any way that we could control for attempts, 
say, to only look at routes that have been attempted at least ~50 times?

### Bonus

Controlling for any variables that may have impacted our findings is extremely important. That's why determining whether
or not the weather conditions at a certain time may have affected our results. Using the weather data CSV file, how can 
we adjust or confirm our results such that they haven't been affected by varying weather conditions?