# Times University ranking dataset analysis

In this codealong we are going to analyze a ranking of universities using regression. Specifically, we are going to **predict the university ranking** with the provided predictors.

---

The information provided in the csv contains:

- **world_rank** - world rank for the university. Contains rank ranges and equal ranks (eg. =94 and 201-250).
- **university_name** - name of university.
- **country** - country of each university.
- **teaching** - university score for teaching (the learning environment).
- **international** - university score international outlook (staff, students, research).
- **research** - university score for research (volume, income and reputation).
- **citations** - university score for citations (research influence).
- **income** - university score for industry income (knowledge transfer).
- **total_score** - total score for university, used to determine rank.
- **num_students** - number of students at the university.
- **student_staff_ratio** - Number of students divided by number of staff.
- **international_students** - Percentage of students who are international.
- **female_male_ratio** - Female student to Male student ratio.
- **year** - year of the ranking (2011 to 2016 included).

We are going to predict the **total score**, which directly corresponds to the ranking.

---

### ONLY THE DATA PATH IS PROVIDED!

The analysis is up to you. This is an open ended practice. You are expected to:

- Load the packages you need to do analysis
- Perform EDA on variables of interest
- Form a hypothesis or hypotheses on what is important for the score
- Check your data for problems, clean and munge data into correct formats
- Create new variables from columns if necessary
- Perform statistical analysis with regression and describe the results

---

If you do not know how to do something **check documentation first.** I look up things in documentation all the time. 

**You are not expected to know how to do things by heart. Knowing how to effectively look up the answers on the internet is a critical skill for data scientists!**

In [1]:
uni_data_path = './dataset/timesData.csv'

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
unidata = pd.read_csv(uni_data_path)

In [8]:
unidata.columns

Index([u'world_rank', u'university_name', u'country', u'teaching',
       u'international', u'research', u'citations', u'income', u'total_score',
       u'num_students', u'student_staff_ratio', u'international_students',
       u'female_male_ratio', u'year'],
      dtype='object')

In [27]:
def cleaner(df):
    try:
        str(j)
        j.split(' : ')[0]
        j.strip()
        j.strip('%')
        j.strip(',')
        j.strip('-')
        float(j)
    except:
        ValueError 
                
unidata.applymap(cleaner)
unidata.dropna(axis=1, thresh=3)
unidata.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45 : 55,2011


In [22]:
# unidata['country'].value_counts

unidata['country'] = pd.Series(unidata['country'], dtype="category")
country_dummy = pd.get_dummies(unidata['country']).head()
is_us = country_dummy['United States of America']


unidata_us = pd.concat([unidata, is_us], axis=1)
unidata_us['is_usa'] = unidata_us['United States of America']
unidata_us.head()

unidata_us['female_male_ratio']




                



thresh=2
# pd.concat([df.drop('key',axis=1),pd.get_dummies(df['key'])], axis = 1)


Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,United States of America,is_usa
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011,1,1
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011,1,1
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011,1,1
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42 : 58,2011,1,1
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45 : 55,2011,1,1


In [None]:
unidata_sub = unidata['country', 'teaching','international', 'research', 'citations', 'income','num_students',
               'student_staff_ratio', 'international_students','female_male_ratio', 'year']

unidata_target = unidata['total_score']

