# Understanding Labour Market

## About the challenge

** India produces 1.5 million engineers every year. A relevant question is what determines the salary and the jobs these engineers are offered right after graduation. Various factors such as college grades, candidate skills, proximity of the college to industrial hubs, the specialization one is in, market conditions for specific industries determine this. **

** Given profiles of several students with varied background, use the data to get insights and answers to these following questions: **

* **Predictive Modelling** - Given a new student profile, can we predict his/her **annual salary** from historic data?
* **Recommendation** - Can we identify the key set of parameters in his profile changing which, she/he would get to earn a **better salary**?
* **Data Insights** - Can we understand what factors in the labor market determine one’s salary? Is it just one’s skills or there are other factors which influence the return in the labor market? What signals and biases enter the labor market? Can we make interpretable models or visualize features to understand what determines salary – for instance do kids from smaller towns get lower salaries? This can help us understand inefficiencies in the labor market, which will be extremely useful for policy making and constructing interventions.
* **Visualization** - Finally, can we visualize **where** and **what** jobs people get to get a quick and deeper understanding? 

In [1]:
# display inline plots
%matplotlib inline

# import libraries for numerical and scientific computing
import numpy as np

# import matplotlib for plotting
import matplotlib.pyplot as plt

# import pandas for data wrangling and munging
import pandas as pd

# set some options for better view
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# import plotting library built on top of matplotlib
import seaborn as sns

# set some settings related to style of plots that will render
sns.set_style("whitegrid")
sns.set_context("poster")

import warnings
warnings.filterwarnings('ignore')



In [35]:
train = pd.read_excel('./data/train.xlsx', parse_dates=True)
test = pd.read_excel('./data/test.xlsx', parse_dates=True)

In [36]:
# set ID as the index
train = train.set_index('ID')
test = test.set_index('ID')

## Data Description

|Input|Description|
|------|------|
|ID  |A unique ID to identify a candidate  |
|Salary  |Annual CTC offered to the candidate (in INR)  |
|DOJ  |Date of joining the company  |
|DOL  |Date of leaving the company  |
|Designation  |Designation offered in the job  |
|JobCity  |City in which the candidate is offered the job  |
|Gender  |Candidate's gender  |
|DOB  |Date of birth of candidate  |
|10percentage  |Overall marks obtained in grade 10 examinations  |
|10board  |The school board whose curriculum the candidate followed in grade 10  |
|12graduation  |Year of graduation - senior year high school  |
|12percentage  |Overall marks obtained in grade 12 examinations  |
|12board  |The school board whose curriculum the candidate followed  |
|CollegeID  |Unique ID identifying the university/college which the candidate attended for her/his undergraduate  |
|CollegeTier  |Each college has been annotated as 1 or 2. The annotations have been computed from the average AMCAT scores obtained by the students in the college/university. Colleges with an average score above a threshold as tagged as 1 and others as 2.  |
|Degree  |Degree obtained/pursued by the candidate  |
|Specialization  |Specialization pursued by the candidate  |
|CollegeGPA  |Aggregate GPA at graduation  |
|CollegeCityID  |A unique ID to identify the city in which the college is located in.  |
|CollegeCityTier  |The tier of the city in which the college is located in. This is annotated based on the population of the cities.  |
|CollegeState  |Name of the state in which the college is located  |
|GraduationYear  |Year of graduation (Bachelor's degree)  |
|English  |Scores in AMCAT English section  |
|Logical  |Score in AMCAT Logical ability section  |
|Quant  |Score in AMCAT's Quantitative ability section  |
|Domain  |Scores in AMCAT's domain module  |
|ComputerProgramming  |Score in AMCAT's Computer programming section  |
|ElectronicsAndSemicon  |Score in AMCAT's Electronics & Semiconductor Engineering section  |
|ComputerScience  |Score in AMCAT's Computer Science section  |
|MechanicalEngg  |Score in AMCAT's Mechanical Engineering section  |
|ElectricalEngg  |Score in AMCAT's Electrical Engineering section  |
|TelecomEngg  |Score in AMCAT's Telecommunication Engineering section  |
|CivilEngg  |Score in AMCAT's Civil Engineering section  |
|conscientiousness  |Score in AMCAT's Civil Engineering section  |
|agreeableness  |Scores in one of the sections of AMCAT's personality test  |
|extraversion  |Scores in one of the sections of AMCAT's personality test  |
|nueroticism  |Scores in one of the sections of AMCAT's personality test  |
|openess_to_experience  |Scores in one of the sections of AMCAT's personality test  |


In [37]:
# preprocess columns to lowercase letters
train.columns = train.columns.map(lambda x: x.lower())
test.columns = test.columns.map(lambda x: x.lower())

In [38]:
train.columns

Index([u'salary', u'doj', u'dol', u'designation', u'jobcity', u'gender', u'dob', u'10percentage', u'10board', u'12graduation', u'12percentage', u'12board', u'collegeid', u'collegetier', u'degree', u'specialization', u'collegegpa', u'collegecityid', u'collegecitytier', u'collegestate', u'graduationyear', u'english', u'logical', u'quant', u'domain', u'computerprogramming', u'electronicsandsemicon', u'computerscience', u'mechanicalengg', u'electricalengg', u'telecomengg', u'civilengg',
       u'conscientiousness', u'agreeableness', u'extraversion', u'nueroticism', u'openess_to_experience'],
      dtype='object')

## Data Exploration

### Question: Which cities provide the most job opportunities for engineers ?

In [39]:
# lets check out the jobcity column
train.jobcity.unique()

array([u'Bangalore', u'Indore', u'Chennai', u'Gurgaon', u'Manesar',
       u'Hyderabad', u'Banglore', u'Noida', u'Kolkata', u'Pune', -1,
       u'mohali', u'Jhansi', u'Delhi', u'Hyderabad ', u'Bangalore ',
       u'noida', u'delhi', u'Bhubaneswar', u'Navi Mumbai', u'Mumbai',
       u'New Delhi', u'Mangalore', u'Rewari', u'Gaziabaad', u'Bhiwadi',
       u'Mysore', u'Rajkot', u'Greater Noida', u'Jaipur', u'noida ',
       u'HYDERABAD', u'mysore', u'THANE', u'Maharajganj',
       u'Thiruvananthapuram', u'Punchkula', u'Bhubaneshwar', u'Pune ',
       u'coimbatore', u'Dhanbad', u'Lucknow', u'Trivandrum', u'kolkata',
       u'mumbai', u'Gandhi Nagar', u'Una', u'Daman and Diu', u'chennai',
       u'GURGOAN', u'vsakhapttnam', u'pune', u'Nagpur', u'Bhagalpur',
       u'new delhi - jaisalmer', u'Coimbatore', u'Ahmedabad',
       u'Kochi/Cochin', u'Bankura', u'Bengaluru', u'Mysore ', u'Kanpur ',
       u'jaipur', u'Gurgaon ', u'bangalore', u'CHENNAI', u'Vijayawada',
       u'Kochi', u'Beawar', u'

** There are some missing values marked as -1, we would have to convert these into strings and then would have to preprocess these city names so that Noida, NOIDA, noida all count as same **

In [40]:
train['jobcity'] = train.jobcity.astype('str')
test['jobcity'] = test.jobcity.astype('str')

In [42]:
def preprocess_city_name(city_name):
    city_name = city_name.strip()
    city_name = city_name.lower()
    return city_name

train['jobcity'] = train.jobcity.map(preprocess_city_name)
test['jobcity'] = test.jobcity.map(preprocess_city_name)

In [44]:
# now lets find out the count distribution for various cities
train.jobcity.value_counts()[:10]

bangalore    665
-1           461
noida        389
hyderabad    368
pune         327
chennai      313
gurgaon      217
new delhi    204
mumbai       119
kolkata      119
Name: jobcity, dtype: int64

** As you can see Bangalore leads the way followed by Noida, Hyderabad and Pune, most people did not mention their city.**

## Question: What are the mean salaries in various cities ?

In [48]:
# group data by city
group_by_city = train.groupby('jobcity')['salary']
group_by_city.mean().order(ascending=False)[:10]

jobcity
kalmar, sweden    2300000
london            2000000
johannesburg      1745000
angul             1300000
maharajganj       1200000
dubai             1140000
muzaffarpur       1000000
durgapur           850000
dammam             775000
panchkula          733333
Name: salary, dtype: int64

** As you can see Indians working abroad earn much more **

## Question: What are the median salaries in various cities ?

In [49]:
# group data by city
group_by_city = train.groupby('jobcity')['salary']
group_by_city.median().order(ascending=False)[:10]

jobcity
kalmar, sweden    2300000
london            2000000
johannesburg      1745000
dubai             1300000
angul             1300000
maharajganj       1200000
muzaffarpur       1000000
durgapur           850000
dammam             775000
rajpura            700000
Name: salary, dtype: int64