# Business Understanding
#### The purpose of this exercise is to find any trends of valuable insight in the stackoverflow survery data, via implementation of knowledge acquired in the firt section of the Udacity Nano Degree in Data Science.

# Data Understanding
### The survey data available consists of a set of questions answered by a pletora of individuals to have used/visited the stackoverflow (on the years 2017,2018 and 2019) website. An interesting fact about the set of questions is that, sometimes the wording of the questions/answer column headers, varied, so some cleaning had to be done to be able to implement some aggredation operations.
### After the data exploration and Understanding, the following questions were formulated for the purpose of this exercise:
### 1. What is the most popular lever of formal education individuals will hold for year 2020?
### 2. What will be the most desired language to work with?
### 3.Country with the most users for 2020

### The required libraries for this example are pandas, numpy and sklearn. These can be installed via `pip install <library>`.
### matplotlib is an optional library if one would like to create any visuals to data and its findings. I will not use it, since I am soley interested in the prediction values.

In [20]:
import pandas as pd 
import numpy as np
from sklearn.linear_model import LinearRegression
# import matplotlib.pyplot as plt

# Reading data and cleaning

In [21]:
data2017 = pd.read_csv('project1_data/developer_survey_2017/survey_results_public.csv')
data2017['SurveyYear'] = 2017
data2017.shape[1]
data2017 = data2017.rename(columns={'WantWorkLanguage':'LanguageDesireNextYear'})
# Line above is an example of the cleaning done to have desired columns named the same, in order to use a groupby aggregation.

In [22]:
data2018 = pd.read_csv('project1_data/developer_survey_2018/survey_results_public.csv')
data2018['SurveyYear'] = 2018
data2018.shape[1]

  interactivity=interactivity, compiler=compiler, result=result)


130

In [23]:
data2019 = pd.read_csv('project1_data/developer_survey_2019/survey_results_public.csv')
data2019['SurveyYear'] = 2019
data2019 = data2019.rename(columns={'EdLevel':'FormalEducation'})

In [24]:
allData_complete = pd.concat([data2017,data2018,data2019], axis=0, ignore_index=True,sort=False)
allData_complete = allData_complete.reset_index(drop=True)

In [25]:
pd.DataFrame(allData_complete.groupby(['FormalEducation','SurveyYear']).count()['Respondent'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Respondent
FormalEducation,SurveyYear,Unnamed: 2_level_1
Associate degree,2018,2970
Associate degree,2019,2938
Bachelor's degree,2017,21609
"Bachelor’s degree (BA, BS, B.Eng., etc.)",2018,43659
"Bachelor’s degree (BA, BS, B.Eng., etc.)",2019,39134
Doctoral degree,2017,1308
I never completed any formal education,2017,426
I never completed any formal education,2018,700
I never completed any formal education,2019,553
I prefer not to answer,2017,1109


# Cleaning data for Formal Education prediction
### Prepare Data
### Below a `.dropna()` method was used. Since were only interested on those individuals that answer the question of interest. Imputing values can create false positives or assuptions that I would like to rule out. Also the option to not provide an answer was available.

In [26]:
data_FormalEducation = allData_complete[['Respondent','FormalEducation','SurveyYear']]
data_FormalEducation = data_FormalEducation.dropna(axis=0)
data_FormalEducation = data_FormalEducation.reset_index(drop=False)

myRegex = ["(?i)bachelor.*","(?i).*master.*","(?i).*doctoral.*","(?i).*some.*","(?i).*secondary.*","(?i).*professional.*"]
myValues = ["Bachelor's degree","Master’s degree","Doctoral degree","Some college/university study without earning a degree","Secondary school","Professional degree"]

data_FormalEducation = data_FormalEducation.replace(to_replace=myRegex, value=myValues, regex=True)    

In [27]:
FormalEd_grouped = pd.DataFrame(data_FormalEducation.groupby(['FormalEducation','SurveyYear']).count()['Respondent'])
FormalEd_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Respondent
FormalEducation,SurveyYear,Unnamed: 2_level_1
Associate degree,2018,2970
Associate degree,2019,2938
Bachelor's degree,2017,21609
Bachelor's degree,2018,43659
Bachelor's degree,2019,39134
Doctoral degree,2017,1308
Doctoral degree,2018,2214
Doctoral degree,2019,2432
I never completed any formal education,2017,426
I never completed any formal education,2018,700


In [28]:
eds = FormalEd_grouped.index.get_level_values("FormalEducation").unique()
educationResults = pd.DataFrame(columns=['FormalEducation','2020Forecast'])

# Loop through education levels to make predictions for year 2020
### Here is where the Data Modeling occurs. A function defined as  `predictionFunction`, was written to take common objects for each prediction, and return a data set with all predictoins under each desired category.

In [29]:
def predictionFunction(resultsDF, groupedDF, indexList, predictionCategory):
    '''
    runs a linear regressoin model to predict values of the desired category for year 2020
    output: returns the results provided dataframe with the predictions
    '''
    results_ind = 0
    for ind in indexList:
        X = groupedDF.xs(ind).index.values.reshape(-1,1)
        Y = groupedDF.loc[(ind,), ].values.reshape(-1, 1) # values converts it into a numpy array
        linear_regressor = LinearRegression(n_jobs=-1)
        linear_regressor.fit(X, Y)
        Y_pred = linear_regressor.predict([[2020]])
        
        resultsDF.loc[results_ind, predictionCategory] = ind
        resultsDF.loc[results_ind, '2020Forecast'] = Y_pred[0][0]
        results_ind += 1

predictionFunction(educationResults,FormalEd_grouped,eds,'FormalEducation')

## Conclusion to question 1
### Evaluate the Results
#### Results show that most users on stackoverflow will hold a Bachelor's degree for year 2020

In [30]:
educationResults.sort_values('2020Forecast', ascending=False)

Unnamed: 0,FormalEducation,2020Forecast
1,Bachelor's degree,52325.7
5,Master’s degree,25796.7
9,Some college/university study without earning ...,12486.7
8,Secondary school,10567.7
2,Doctoral degree,3108.67
0,Associate degree,2906.0
6,Primary/elementary school,1750.0
7,Professional degree,1603.0
4,I prefer not to answer,1109.0
3,I never completed any formal education,686.667


# Cleaning data for desired language prediction
### Prepare Data
#### *Assumed the first selected language was the most desired by the individual for the purpose of this example
### Below a `.dropna()` method was used. Since were only interested on those individuals that answer the question of interest. Imputing values can create false positives or assuptions that I would like to rule out.

In [31]:
dataLanguage = allData_complete[['Respondent','LanguageDesireNextYear','SurveyYear']]
dataLanguage = dataLanguage.dropna(axis=0)
dataLanguage = dataLanguage.reset_index(drop=False)
dataLanguage.LanguageDesireNextYear = dataLanguage.LanguageDesireNextYear.apply(lambda x: x.split(";")[0])
dataLanguage.LanguageDesireNextYear = dataLanguage.LanguageDesireNextYear.apply(lambda x: x.split("/")[0])


Language_grouped = pd.DataFrame(dataLanguage.groupby(['LanguageDesireNextYear','SurveyYear']).count()['Respondent'])
Language_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Respondent
LanguageDesireNextYear,SurveyYear,Unnamed: 2_level_1
Assembly,2017,1923
Assembly,2018,4165
Assembly,2019,4659
Bash,2018,42
Bash,2019,19041
...,...,...
VBA,2018,27
VBA,2019,24
Visual Basic 6,2017,10
Visual Basic 6,2018,6


In [32]:
langs = Language_grouped.index.get_level_values("LanguageDesireNextYear").unique()
langsResults = pd.DataFrame(columns=['LanguageDesireNextYear','2020Forecast'])

In [33]:
# Calling prediction function defined above

predictionFunction(langsResults,Language_grouped,langs,'LanguageDesireNextYear')

## Conclusion to question 2
### Evaluate the Results
#### Results show that most users' most desired language for year 2020 will be Bash

In [34]:
langsResults.sort_values('2020Forecast', ascending=False)

Unnamed: 0,LanguageDesireNextYear,2020Forecast
1,Bash,38040.0
17,HTML,21528.0
3,C#,16153.3
4,C++,8460.33
15,Go,8281.67
2,C,6704.33
20,Java,6516.33
0,Assembly,6318.33
31,Python,4575.67
21,JavaScript,4479.0


# Cleaning data for country with the most users prediction
### Prepare Data
### Below a `.dropna()` method was used. Since were only interested on those individuals that answer the question of interest. Imputing values can create false positives or assuptions that I would like to rule out.

In [35]:
dataCountry = allData_complete[['Respondent','Country','SurveyYear']]
dataCountry = dataCountry.dropna(axis=0)
dataCountry = dataCountry.reset_index(drop=False)

Country_grouped = pd.DataFrame(dataCountry.groupby(['Country','SurveyYear']).count()['Respondent'])
Country_grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,Respondent
Country,SurveyYear,Unnamed: 2_level_1
Afghanistan,2017,60
Afghanistan,2018,64
Afghanistan,2019,44
Aland Islands,2017,22
Albania,2017,76
...,...,...
Zambia,2018,9
Zambia,2019,12
Zimbabwe,2017,20
Zimbabwe,2018,39


In [36]:
countries = Country_grouped.index.get_level_values("Country").unique()
countryResults = pd.DataFrame(columns=['Country','2020Forecast'])

In [37]:
# Calling prediction function defined above

predictionFunction(countryResults,Country_grouped,countries,'Country')

## Conclusion to question 3
### Evaluate the Results
#### Results show that the highest numnber of users for year 2020 will be from the United States

In [38]:
countryResults.sort_values("2020Forecast", ascending=False)

Unnamed: 0,Country,2020Forecast
228,United States,27065
96,India,13190.3
78,Germany,7212.33
226,United Kingdom,6793
40,Canada,4169
...,...,...
153,Niger,0
67,Eritrea,-1
182,Saint Lucia,-1
155,North Korea,-1.33333


## Summary
One can conlcude that in order to make a simple prediction with historical data, only 3 pythonlinbraries are needed. The more data, the more accurate the model can potentially be. Scoring is always recomended to asses the accuracy of the model. But in these case there was not a lot of information available to effectivelty create training and test data sets, but overall it is a simple and clean implementation of a the linear regreation library in sklern