## Can we predicting the salary for developers who use more than 1 programming language?
Let's see if we can predict a person's salary based on the language they use.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

df = pd.read_csv('data/2020.csv')

In [2]:
# We'll select only the salary and languages people use
df_modified = df[['ConvertedComp', 'LanguageWorkedWith']]
# Drop all the NaNs (missing language or slary data) so we can first check if this model works at all
df_modified = df_modified.dropna()
df_modified.head()

Unnamed: 0,ConvertedComp,LanguageWorkedWith
7,116000.0,Python;SQL
9,32315.0,HTML/CSS;Java;JavaScript;Python;SQL
10,40070.0,C#;JavaScript;Swift
11,14268.0,HTML/CSS;JavaScript
12,38916.0,C;JavaScript;Python


In [3]:
def split_column(column: pd.Series, separator: str = ';') -> pd.DataFrame:
    """
    INPUT:
    responses - pandas Series with multiple values per cell, separated by a delimiter
    separator - separator between the values in the cells

    OUTPUT:
    pandas DataFrame where for each unique value exists a column and its value is either 1 or 0
    """
    return column.str.get_dummies(sep=separator)

# Split the categorical variable language into columns and separate the salary
x = split_column(df_modified['LanguageWorkedWith'])
salary_language_df = pd.concat([df_modified, x], axis=1).drop(['LanguageWorkedWith'], axis=1)
y = df_modified['ConvertedComp']

In [4]:
# Split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=16)
model = LinearRegression()
model.fit(x_train, y_train);
y_predicted_train = model.predict(x_train)
y_predicted_test = model.predict(x_test)
# Evalueate the model using r2_score
print('r2 score for the train data', r2_score(y_train, y_predicted_train))
print('r2 score for the test data', r2_score(y_test, y_predicted_test))

r2 score for the train data 0.015749548071119612
r2 score for the test data 0.01361132262454745


Judging by the r2 scores we get, we cannot predict accurately a developer's salary based only on the languages they use with simple linear regression.