# Class project

- every student makes their own "small" data science project including 5 minutes presentation
- schedule:
    - by January 14th submit the topic of your project and source of data (group discussion during January 11th class)
    - exploratory analysis progress report by January 21st
    - data munging progress report by January 28th
    - presentations and Q&A during last class on February 8th
    - final project submission deadline February 18th
    
- outline of progress
    - find a topic and source of data - _[see this video to get ideas on data sources](https://youtu.be/KbM596pSQ_A)_
    - proceed with exploratory analysis
    - proceed with data munging
    - visualize and interpret or build a predictive model
    - create a presentation of your results (must present to pass the class)
    - submit your final project in the Jupyter notebook format (including commentary and data)



# Data Science Tutorial

- an example of how to use Python for Data Science

- following _[A Complete Python Tutorial to Learn Data Science from Scratch](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/)_

- today's plan


#### Table of content

1. Basics of Python for Data Analysis
        Why learn Python for data analysis?
        Python 2.7 v/s 3.4
        How to install Python?
        Running a few simple programs in Python
   
   
2. Python libraries and data structures
        Python Data Structures (lists, strings, tuples, dictionaries)
        Python Iteration and Conditional Constructs
        Python Libraries


3. Exploratory analysis in Python using Pandas
        Introduction to series and dataframes
        Analytics Vidhya dataset - Loan Prediction Problem


4. **Data Munging in Python using Pandas**


5. Building a Predictive Model in Python
        Logistic Regression
        Decision Tree
        Random Forest


## Data munging/wrangling using Pandas

     Data Wrangling, sometimes referred to as Data Munging, is the process of transforming and mapping data from   one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data. 
     
     - soure: Wikipedia

### Recap

- we found a few problems in the data set 
    - missing values in some variables
    - ApplicantIncome and LoanAmount seem to contain extreme values at either end
    - check non-numerical fields i.e. Gender, Property_Area, Married, Education and Dependents for any useful information
- above problems need to be solved before the data is ready for a good model   
- for more details check _[A Comprehensive Guide to Data Exploration](https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)_

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('skinclub_database.csv') 

# Convert -null- wear into 'nw' (no wear)
df.loc[ df['wear'] == ' ', 'wear'] = 'nw'

## Calculate & Convert Units
df['price'] = df['price'].str.replace('$', '').astype(float)
df['cost'] = df['cost'].str.replace('$', '').astype(float)
df['odds'] = df['odds'].str.replace('%', '').astype(float)

df['return'] = df['price'] - df['cost']
df['return_p'] = (df['price'] / df['cost'])*100

df.to_csv("skinclub_database_processed.csv")
print(df)

                case    cost    type              skin  rarity wear  ST  \
0          Mil-Spec     0.35    M4A4  Radiation Hazard       3  ft    0   
1          Mil-Spec     0.35    M4A4  Radiation Hazard       3  ww    0   
2          Mil-Spec     0.35  M4A1-S          Briefing       3  mw    1   
3          Mil-Spec     0.35  M4A1-S          Briefing       3  bs    1   
4          Mil-Spec     0.35    M4A4       Faded Zebra       3  fn    0   
...              ...     ...     ...               ...     ...  ...  ..   
4633  Echo Hope Case  100.00   USP-S             Orion       5  mw    0   
4634  Echo Hope Case  100.00   AK-47           Asiimov       6  mw    0   
4635  Echo Hope Case  100.00  M4A1-S   Chantico's Fire       6  mw    0   
4636  Echo Hope Case  100.00    M4A4         Buzz Kill       6  fn    0   
4637  Echo Hope Case  100.00   AK-47   Wasteland Rebel       6  ft    0   

      price  min_val  max_val    odds  \
0     27.60    99989   100000   0.012   
1     25.10    99

### Check missing values in the dataset

In [6]:
cases = pd.unique(df['case'])
for case in cases:
    
    # Isolate Each Case
    data = df.loc[df['case'] == case]

    # Determine No of Outcomes
    Outcomes, col = data.shape

    print(case)
    # Reset indexing for counter
    data = data.reset_index(drop=True)

    # Determine Cost
    cost = data.at[1,'cost']
    print(cost)
    # Calculate ROI
    Net_Profit = 0
    for n in range(Outcomes):
        odd_val = data.at[n,'odds']
        ret_val = data.at[n,'return']
        calc = (odd_val * ret_val)
        Net_Profit += calc
    ROI = (Net_Profit / cost) * 100

    # Gain Calculations
    gain_set = data[data['price'] >= data['cost']]

        # Determine No of Outcomes
    Pos_results, col = gain_set.shape

        # Reset indexing for counter
    gain_set = gain_set.reset_index(drop=True)

        # Calculate Avg Positive Return
    Avg_gain_return = 0
    for n in range(Pos_results):
        odd_val = gain_set.at[n,'odds']
        ret_val = gain_set.at[n,'return']
        calc = (odd_val * ret_val)
        Avg_gain_return += calc
    Avg_gain_return_p = Avg_gain_return/cost

        # Calculate Max Gain
    stats = gain_set['return']
    Max_Return = stats.max()
    stats = gain_set['return_p']
    Max_Return_p = stats.max()
    
    # Breakeven Calculations
    stats = gain_set['min_val']
    Profit_Stat = stats.min()
    Profit_Chance = (100000-Profit_Stat)/1000

    # Loss Calculations
    loss_set = data[data['price'] < data['cost']]

        # Determine No of Outcomes
    Neg_results, col = loss_set.shape

        # Reset indexing for counter
    loss_set = loss_set.reset_index(drop=True)

        # Calculate Avg Negative Return
    Avg_loss_return = 0
    for n in range(Neg_results):
        odd_val = loss_set.at[n,'odds']
        ret_val = loss_set.at[n,'return']
        calc = (odd_val * ret_val)
        Avg_loss_return += calc
    Avg_loss_return_p = Avg_loss_return/cost

        # Calculate Min Gain
    stats = loss_set['return']
    Min_Return = stats.min()
    stats = loss_set['return_p']
    Min_Return_p = stats.min()

    # Roll Calculations
        # ~50 Rolls
    data['roll50'] = round((data['odds']/100)* 50)
    no_roll = data['roll50'].sum()
    data['prof'] = data['roll50'] * data['price']
    Roll_50 = (data['prof'].sum()) - (no_roll * cost)
    Roll_50_p = Roll_50/(no_roll * cost)
    print(Roll_50, Roll_50_p)
        # ~100 Rolls
    data['roll100'] = round((data['odds']/100)* 100)
    no_roll = data['roll100'].sum()
    data['prof'] = data['roll100'] * data['price']
    Roll_100 = (data['prof'].sum()) - (no_roll * cost)
    Roll_100_p = Roll_100/(no_roll * cost)
    print(Roll_100, Roll_100_p)
        # ~1000 Rolls
    data['roll1000'] = round((data['odds']/100)* 1000)
    no_roll = data['roll1000'].sum()
    data['prof'] = data['roll1000'] * data['price']
    Roll_1000 = (data['prof'].sum()) - (no_roll * cost)
    Roll_1000_p = Roll_1000/(no_roll * cost)
    print(Roll_1000, Roll_1000_p)
        # ~1000 Rolls
    data['roll100000'] = round((data['odds']/100)* 100000)
    no_roll = data['roll100000'].sum()
    data['prof'] = data['roll100000'] * data['price']
    Roll_100000 = (data['prof'].sum()) - (no_roll * cost)
    Roll_100000_p = Roll_100000/(no_roll * cost)
    print(Roll_100000, Roll_100000_p)

    ## Write_TO_CSV Code


Mil-Spec 
0.35
-10.809999999999999 -0.6863492063492064
-19.349999999999998 -0.5881458966565349
-88.93 -0.2548502650809572
-3490.510000000002 -0.0997288571428572
Fire and Water Case
8.0
-187.37 -0.48794270833333336
-385.3300000000001 -0.4965592783505156
-1061.1100000000006 -0.13250624375624384
-80794.72999999986 -0.10099341249999984
Rainbow  Case
3.5
-101.72 -0.6183586626139818
-161.19 -0.47478645066273933
-425.7800000000002 -0.12189521900944753
-34707.139999999956 -0.09916325714285702
Hot  Case
5.5
-106.41000000000003 -0.38694545454545465
-194.11 -0.36012987012987013
-433.09000000000015 -0.07866497139224414
-54680.54000000004 -0.0994191636363637
Restricted 
2.0
-51.31 -0.641375
-94.84 -0.5098924731182796
-288.1700000000001 -0.1437974051896208
-19920.599999999977 -0.09960299999999989
Classified  Case
3.0
-57.33 -0.4065957446808511
-103.78999999999999 -0.36038194444444444
-472.27999999999975 -0.15805890227576966
-29758.129999999946 -0.09919376666666649
Covert  Case
6.99
-134.810000000000