# Python Data Analysis DOJO
In this part of the DOJO, we will leverage some of Python's popular data analysis capabilities and 3rd Party Libraries.  These are commonly used within GE by our Data Scientists and Data Engineers.

AWS has additionally provided an Out-of-the-Box Lambda Layer to efficiently (in terms of memory size) including some of these capabilities in AWS Lambda Functions.

In [1]:
# Importing in required libraries and configuring the Plotting library to display viz inline in the Notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy

%matplotlib inline

## Problem 1. Read in the StackOverflow Developer Survey data and metadata into DataFrames and review what's available

In [2]:
# Read in survey data and visualize top of data
data = pd.read_csv('./survey_results_public.csv')
metadata = pd.read_csv('./survey_results_schema.csv')

FileNotFoundError: [Errno 2] File b'./survey_results_public.csv' does not exist: b'./survey_results_public.csv'

In [None]:
data.head()

In [None]:
metadata.head()

In [None]:
data.drop(['Respondent'], axis=1).describe(include='all')

## Problem 2. For each Social Media platform, what's the average age of respondents who like it best?

In [None]:
data.groupby('SocialMedia')['Age'].mean().sort_values(ascending=False)

## Problem 3. For each 10-year Age group (1-10 year olds, 11-20 year olds, etc.), what's the average level of compensation?

In [None]:
bins = [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['1-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']

# Add a new column to the DataFrame holding the Age Groups
data['AgeGroup'] = pd.cut(data['Age'], bins, labels=labels)

In [None]:
comp_by_age = data.groupby('AgeGroup')['ConvertedComp'].mean()
print(comp_by_age)

## Problem 4. Write Plotting Function to visualize the above results

In [None]:
def plotGroupedData(series, group_col, val_col, title, y_label):
    # Pass is a No-Op command that allows for nothing to happen while still meeting whitespace syntax requirements
    # Often used as a placeholder for "Code to be Filled In Later"
    # pass
    num_of_values = len(series.index.categories)
    plt.bar(range(num_of_values), series.values, align='center')
    plt.xticks(range(num_of_values), series.index.categories)
    plt.ylabel(y_label)
    plt.title(title)

In [None]:
plotGroupedData(comp_by_age, 'AgeGroup', 'ConvertedComp', 'Average Compensation by Age Group', 'Salary / Yr. (USD)')

## Problem 5. Using your Age Groups from earlier, what is the most common Work Challenge faced per Age Group?
Notice any issues with the data format below? How can we split this data up to get a true count of Work Challenges?

In [None]:
data['WorkChallenge'].value_counts().nlargest(10)

In [None]:
# Many columns in the dataset have this same format. Can we implement it so it can be used across many different 
# columns (i.e. WorkChallenge, JobFactors, DevEnviron, etc.)?
def most_popular_value_by_group(key_col_name, target_col_name):
    keys = list(data[key_col_name])
    targets = list(data[target_col_name])

    out_keys = []
    out_targets = []

    for i in range(len(keys)):
        if not pd.isnull(targets[i]):
            key_targets = targets[i].split(';')
            key = keys[i]

            key_list = [key] * len(key_targets)
            
            out_keys.extend(key_list)
            out_targets.extend(key_targets)

    split_targets_by_key = pd.DataFrame({key_col_name:out_keys, target_col_name:out_targets})
    return split_targets_by_key.groupby(key_col_name)[target_col_name].apply(lambda x: x.mode()[0])
        

In [None]:
df = most_popular_value_by_group('AgeGroup', 'WorkChallenge')
print(df)

# Helper Commands Below

In [None]:
with pd.option_context('display.max_rows', 10000): 
    display(metadata['Column'])

In [None]:
text = metadata[metadata['Column'] == 'MainBranch']['QuestionText']
print(list(text))

In [None]:
data[['SocialMedia', 'MgrIdiot']].describe(include='all')