# SALARY PREDICTOR

# # In this report we use the salary.csv file data to predict employee salaries from different employee characteristics (or features). This will be achieved through Statistical methods and Exploratory Data Analysis (EDA).

### In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task*. (https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190)

Below are the packages imported into this Notebook that will be used in the report to extract and visualize the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
import csv
%matplotlib inline

Data from the Salary.csv file is imported as a pandas dataframe using Pandas pd.read_csv("") method. Below the data is imported and stored under the name data.

In [2]:
data = pd.read_csv("salary.csv")

The code and data visualization in the next few lines is shown with the aim to find details like the number of columns, rows and other metadata which will help us to gauge size and other properties such as the range of values in the columns of the data (salary) dataset.

We use the Pandas .head() and .tail() methods to display all the columns and only the first and last Five(5) rows of the dataset.

In [3]:
data.head()

Unnamed: 0,salary,exprior,yearsworked,yearsrank,market,degree,otherqual,position,male,Field,yearsabs
0,53000.0,0,0,0,1.17,1,0,1,1,3,0
1,58000.0,1,0,0,1.24,1,0,1,1,2,0
2,45500.0,0,0,0,1.21,1,0,1,1,3,2
3,35782.0,0,2,1,0.99,1,0,1,1,4,1
4,34731.0,0,2,2,0.91,1,0,1,1,4,1


In [4]:
data.tail()

Unnamed: 0,salary,exprior,yearsworked,yearsrank,market,degree,otherqual,position,male,Field,yearsabs
509,45906.0,6,17,11,0.92,1,0,2,0,2,70
510,60402.0,6,19,7,0.86,1,0,3,0,1,72
511,53187.0,19,7,6,0.78,1,0,3,0,4,76
512,56542.0,8,20,10,0.78,1,0,3,0,2,78
513,52662.0,13,25,11,0.78,1,0,3,0,1,112


Below we use the Pandas Dataframe.describe() method to view some basic statistical details like percentile, mean, standard deviation etc.

In [5]:
data.describe()

Unnamed: 0,salary,exprior,yearsworked,yearsrank,market,degree,otherqual,position,male,Field,yearsabs
count,513.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0,514.0
mean,50863.220098,2.92607,12.85214,7.052529,0.948521,0.964981,0.044747,2.132296,0.750973,2.529183,6.98249
std,12685.132358,4.791397,9.444695,6.414771,0.14938,0.184008,0.20695,0.820075,0.432871,1.12742,16.873156
min,29000.0,0.0,0.0,0.0,0.71,0.0,0.0,1.0,0.0,1.0,0.0
25%,40000.0,0.0,4.0,2.0,0.84,1.0,0.0,1.0,1.0,2.0,0.0
50%,50096.0,0.5,12.0,5.0,0.92,1.0,0.0,2.0,1.0,3.0,1.0
75%,60345.0,4.0,22.0,12.0,1.02,1.0,0.0,3.0,1.0,4.0,2.0
max,96156.0,25.0,41.0,28.0,1.33,1.0,1.0,3.0,1.0,4.0,118.0


# QUESTIONS 

## QUESTION 1.

How many responders are there? Are there any missing values in any of the variables?

To the question, 'how many responders are there ?', we use the .shape method (below) on the data data dataframe (data.shape) as this this will show us that there are 514 rows. This means that there are 514 responders in the salary/data dataframe.

In [6]:
data.shape

(514, 11)

To the question, "Are there any missing values in any of the variables?", we use the .isnull(.values.any() where we find that it produces True, meaning that there is a missing value. We then use the .isnull().sum() method to count how many missing vaues there are and from which columns. We finf that the salary column is the only one with a missing value.

In [7]:
data.isnull().values.any()

True

In [8]:
data.isnull().sum()

salary         1
exprior        0
yearsworked    0
yearsrank      0
market         0
degree         0
otherqual      0
position       0
male           0
Field          0
yearsabs       0
dtype: int64

## QUESTION 2

What is the lowest salary and highest salary in the group?

To ascertain the highest salary (96156.0) in the salary data set we use the .max() method on the data DataFrame and we do the same for the lowest salary (29000.0) but we use the .min() method on data instead. 

In [11]:
data['salary'].max()

96156.0

In [12]:
data['salary'].min()

29000.0

## QUESTION 3. 

What is the mean salary for the sample? Include the standard error of the mean.

To get the mean of the salary column, we use the pandas .mean() function on the data DataFrame and we get a mean of 50863.22. 

In [21]:
data['salary'].mean()

50863.22009783626

To get the Standard Error Mean (sem), we use the pandas dataframe.sem() function return unbiased standard error of the mean over requested axis. We get a SEM of 560.06 of the salary column.

In [22]:
data['salary'].sem(axis = 0, skipna = True) 

560.0622753925232

## QUESTION 4. 

What is the standard deviation for the years worked?

To get the Standard Deviation (std) of the "yearsworked" column, we use the pandas dataframe.std() [data['yearsworked'].std()] function return the standard deviation of the yearsworked column. We get a std of 9.44 of the yearsworked column.

In [24]:
data['yearsworked'].std()

9.444695144169813

## QUESTION 5.

What is the median salary for the sample?

To get the Median of the "salary" column, we use the pandas dataframe.std() [data['salary'].median()] function return the median of the salary column. We get a median of 50096.0 of the salary column.

In [26]:
data['salary'].median()

50096.0

## QUESTION 6.