# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Regression Challenge
Week 3 | Day 5

## The Times university ranking dataset analysis

In this challenge, you will draw on the skills you have learned over the past three weeks to create a model of university prestige using the provided predictors. Specifically, your goal is to **predict the total score for each university for the year 2016**. This score directly maps into the university ranking.

You will be drawing on the following skills:
- Basic python and pandas skills
- Data cleaning
- EDA
- Regression
- Regularization
- Cross validation

## The Dataset

The data is in a csv file in your repo. It contains the following columns:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (e.g. = 94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

The target is the **total score**, which directly corresponds to the final ranking.

**N.B. - if the total score is reported as "-" that will be considered a 0 when scoring.**

## Guidelines

- You will be provided with the data and targets from 2011 through 2015 and the data (no targets) for 2016.<br><br>

- Before 12:00pm, all final answers should be submitted by filling in the  predicted values for each university in the submission sheet. (.to_csv() should be useful for this). <br><br>

- Your submission will be assessed on MSE -- so consider your loss functions! <br><br>

## Guidelines

The analysis is up to you. **This is fully open-ended.** You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis or hypotheses on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create or combine new columns/features where beneficial
- Perform statistical analysis with regression and describe the results

---

I will be here in class to help, but if you do not know how to do something, I expect you to **check documentation first**.

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

## Teams

Finally, you will be working as part of a team on this. To the randomizer...

In [81]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
%matplotlib inline 

from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_predict

In [82]:
dfs = pd.read_csv('/Users/kristensu/dropbox/GA-DSI/DSI-copy/curriculum/week-03/5.1-regression-challenge/datasets/ranking-submission.csv')
dfc = pd.read_csv('/Users/kristensu/dropbox/GA-DSI/DSI-copy/curriculum/week-03/5.1-regression-challenge/datasets/challenge-dataset.csv')

In [83]:
print dfc.shape
dfc.head(3)

(2603, 14)


Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011


In [84]:
float(59.0000/2603)*1.0000001

0.022666156703803304

In [85]:
dfc.isnull().sum()

world_rank                  0
university_name             0
country                     0
teaching                    0
international               0
research                    0
citations                   0
income                      0
total_score               800
num_students               59
student_staff_ratio        59
international_students     67
female_male_ratio         233
year                        0
dtype: int64

In [86]:
dfy = pd.DataFrame(dfc['total_score'])
dfy.head(3)

Unnamed: 0,total_score
0,96.1
1,96.0
2,95.6


In [87]:
dfc.dtypes

world_rank                 object
university_name            object
country                    object
teaching                  float64
international              object
research                  float64
citations                 float64
income                     object
total_score                object
num_students               object
student_staff_ratio       float64
international_students     object
female_male_ratio          object
year                        int64
dtype: object

In [88]:
dfc.head(2)

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011


In [90]:
#world rank - object: int
#international - object: float
#income: object to float
#num_students - object: float
#female_male_ratio - object: float
#international_students - object: float

In [106]:
#world rank: object to int (had to change NaN to 100, note: 100 is median of non-NaNs)
dfc['world_rank'] = pd.to_numeric(dfc['world_rank'], errors='coerce')
dfc['world_rank'].fillna(value=100, inplace=True)
dfc['world_rank'] = dfc['world_rank'].astype(int)
#or dfc.loc[dfc['world_rank'] == .isnull(), 'world_rank'] = '100'

In [74]:
#international: obj to int
dfc['international'] = pd.to_numeric(dfc['international'], errors='coerce')
dfc['international'].fillna(value=np.nan, inplace=True)

In [111]:
#Set INCOME to Median of income without any '-'s
dfc.loc[dfc['income'] == '-', 'income'] = '41.0'
dfc['income'] = dfc['income'].astype(float)

In [78]:
#num_students: obj to float

In [23]:
dfc['female_male_ratio'].head()

0         NaN
1     33 : 67
2     37 : 63
3    42:58:00
4    45:55:00
Name: female_male_ratio, dtype: object

In [24]:
a = dfc['female_male_ratio']

In [25]:
a = a.str.replace('00', '')

In [26]:
a.head()

0        NaN
1    33 : 67
2    37 : 63
3     42:58:
4     45:55:
Name: female_male_ratio, dtype: object

In [29]:
a = a.str.strip()

In [30]:
a

0               NaN
1           33 : 67
2           37 : 63
3            42:58:
4            45:55:
5            46:54:
6            46:54:
7            50:50:
8           37 : 63
9            50:50:
10           52:48:
11           42:58:
12           50:50:
13           48:52:
14          31 : 69
15           48:52:
16              NaN
17              NaN
18           51:49:
19          39 : 61
20           53:47:
21           56:44:
22           53:47:
23           49:51:
24           48:52:
25              NaN
26          31 : 69
27      0.888888889
28           52:48:
29           54:46:
           ...     
2573         45:55:
2574         54:46:
2575         46:54:
2576        27 : 73
2577        34 : 66
2578        34 : 66
2579    0.888888889
2580         60:40:
2581        34 : 66
2582    0.438194444
2583         62:38:
2584         63:37:
2585         61:39:
2586         65:35:
2587         48:52:
2588         53:47:
2589         51:49:
2590         65:35:
2591        34 : 66


0               NaN
1           33 : 67
2           37 : 63
3            42:58:
4            45:55:
5            46:54:
6            46:54:
7            50:50:
8           37 : 63
9            50:50:
10           52:48:
11           42:58:
12           50:50:
13           48:52:
14          31 : 69
15           48:52:
16              NaN
17              NaN
18           51:49:
19          39 : 61
20           53:47:
21           56:44:
22           53:47:
23           49:51:
24           48:52:
25              NaN
26          31 : 69
27      0.888888889
28           52:48:
29           54:46:
           ...     
2573         45:55:
2574         54:46:
2575         46:54:
2576        27 : 73
2577        34 : 66
2578        34 : 66
2579    0.888888889
2580         60:40:
2581        34 : 66
2582    0.438194444
2583         62:38:
2584         63:37:
2585         61:39:
2586         65:35:
2587         48:52:
2588         53:47:
2589         51:49:
2590         65:35:
2591        34 : 66


In [17]:
X = dfc
y = dfy
lr = linear_model.LinearRegression()
lr_model = lr.fit(X,y)

ValueError: invalid literal for float(): 43:57:00

In [None]:
X = dd
y = dy
lr_dd = linear_model.LinearRegression()
lr_dd_model = lr.fit(X, y)

y_true = y
y_pred = lr_model.predict(X)

lr_dd_r2 =  r2_score(y_true=y_true, y_pred=y_pred)

lr_model.coef_
len(lr_model.coef_)
lr_dd_mav = abs(lr_model.coef_).mean()