# OMIS 114 Data Science with Python
## Assignment 4 - Data Aggregation

#### Due February 12 at 3:50PM :: 100 Points Total - 46 Points for Camino Quiz; 54 Points for Jupyter Notebook

**Description:**<br>The file 'cleaned survey.csv' contains the responses from a survey of graduate students enrolled in a data science course.<br>Perform an analysis of the survey responses.<br>Prepare this Jupyter notebook file to professionally present your analysis.

<ul>
<li>Expand the notebook (insert cells) as required.</li>
<li>To respond to the questions in this noteboook, generate a response in the cell immediately following a question.</li>
<li>Complete the associated quiz on Camino.</li>
<li>The points associated with the assignment questions are earned by computing and providing the correct (expected) solution values on the Camino quiz.</li>
<li>Progress points are associated with an analysis task, and are assigned based upon an assessment of the progress made toward performing the analysis task completely and correctly and generating the correct solution values.</li>
<li>Performing an analysis task correctly and generating the correct solution values earns complete progress points.</li>
<li>Up to 8 progress points for notebook presentation, professionalism, and description of analysis steps (comments).</li>
</ul>

<ul>
<li>Include all steps of the analysis in the submitted notebook.</li>
<li>To earn points on a question, the notebook analysis must compute the value provided.</li>
<li>Include a comment describing each step of the analysis.</li>
<li>The analysis code should also function on any other similar survey data.</li>
<li>Additional Python packages (besides those imported) may not be used in the analysis.</li>
<li>Loops may not be used in the analysis.</li>
<li>Complete this assignment independently, without inappropriate collaboration or assistance.</li>
</ul>

**Directions:**<br>Begin by downloading the <b>cleaned survey.csv</b> file and storing it in the <b>same folder as this .ipynb file.</b>

Then, execute the following code to load the survey data in a dataframe <b>df</b>.<br>Use the dataframe <b>df</b> to answer the following questions</b>.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("cleaned survey.csv", index_col=0)
df.drop(['Expert'], axis=1, inplace=True)

In [4]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

In [5]:
df.head()

Unnamed: 0,Job,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages
0,0.0,MSIS,4,1,1,0.0,0,1.0,1.0,0.0,1.0,0.0,1,1,1.0,4,4,0,1,0,0,6
1,0.5,MSIS,3,1,1,0.0,1,0.0,0.0,0.0,1.0,0.0,1,0,0.0,2,2,0,0,0,1,4
2,0.0,MSIS,3,0,0,0.0,1,1.0,0.0,0.0,1.0,0.0,1,0,1.0,3,3,0,0,1,0,3
3,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,1.0,1.0,0.0,1,0,1.0,2,3,0,0,0,1,5
4,0.0,MSIS,3,1,0,0.0,1,1.0,0.0,0.0,1.0,0.0,1,0,0.0,1,1,0,0,1,0,4


#### Data Set Description

The data set contains one row for each student who completed the survey. All survey respondents are considered students in the data science course.

<ul>
<li><b>Job</b>: 0 for students without a job, 0.5 for students with a part-time job, and 1 for students with a full-time job
<li><b>Program</b>: graduate program enrolled
<li><b>ProgSkills</b>: indicates their level of computer programming knowledge (1-5)
<li><b>C through Regression</b>: indicates whether the student knows (1) or doesn't know (0) a specific programming language or topic
<li><b>Classification</b>: indicates their level of knowledge (1-5) on classification methods
<li><b>Clustering</b>: indicates their level of knowledge (1-5) on clustering methods
<li><b>Bach_0to1</b>: 1 if the time elapsed since their Bachelor's graduation is less than a year; 0 otherwise
<li><b>Bach_1to3</b>: 1 if the time elapsed since their Bachelor's graduation is between one and three years; 0 otherwise
<li><b>Bach_3to5</b>: 1 if the time elapsed since their Bachelor's graduation is between three and five years; 0 otherwise
<li><b>Bach_5Plus</b>: 1 if the time elapsed since their Bachelor's graduation is more than five years; 0 otherwise
<li><b>Languages</b>: the number of programming languages known
</ul>

#### Question 1:<br><br>Aggregate students with the same level of computer programming knowledge (1-5; values stored in the 'ProgSkills' column).<br><br>For each of the 5 groups of students (1-5), compute these two statistics:<br><br>The percentage (0-100) of students within the group who know the Python and/or Java programming languages, with level of knowledge on classification methods that is greater than or equal to 2. These values are stored in the 'Python', 'Java', and 'Classification' columns, respectively. This statistic is referred to as 'PJClass%'.<br><br>The standard deviation (function std) of students' level of knowledge on clustering methods. These values are stored in the 'Clustering' column. This statistic is referred to as 'StdevClust'.<br><br>Output a data frame with one row for each level of computer programming knowledge (1-5), and the computed 'PJClass%' and 'StdevClust' statistics for that group of students.<br><br>12 points for result; plus up to 12 progress points

In [6]:
df.loc[df['Python'] == 1.0, 'PJClass'] = 1    #creates a new column 'PJClass' and initializes student to 1 if they know Python
df.loc[df['Java'] == 1.0, 'PJClass'] = 1      #edits 'PJClass' to 1 if student know Java
df['PJClass'] = df['PJClass'].fillna(0)       #fills in the students who have NaN with 0 (ie don't know Python or Java)

new_df = df[df['Classification'] >= 2].groupby('ProgSkills')[['PJClass']].sum()  #creates a new dataframe (new_df) with column the sum of students in each ProgSkills
new_df['ProgSkillsCount'] = df.groupby('ProgSkills')[['ProgSkills']].count()     #creates a new colum in new_df with the count of students who know Python/Java

new_df['PJClass%'] = (new_df['PJClass'] / new_df['ProgSkillsCount']) * 100       #calculates the % of students who know Python/Java for each ProgSkills
new_df['StdevClust'] = df.groupby('ProgSkills')['Clustering'].std()              #calculates the standard deviation of the students' level of knowledge on clustering methods for each ProgSkills

new_df = new_df.drop(columns=['PJClass','ProgSkillsCount'])                      #drops the 'PJClass' and 'ProgSkillsCount' columns because they were just used for calcuations
new_df                                        #prints results

Unnamed: 0_level_0,PJClass%,StdevClust
ProgSkills,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.0,0.786796
2,36.363636,0.687552
3,65.517241,1.085053
4,58.333333,1.083625
5,50.0,0.707107


#### Question 2:<br><br>For each graduate program, how many students have a level of computer programming knowledge that is less than 4? These values are stored in the 'ProgSkills' column.<br><br>Output a series object with one row for each graduate program, and the number of students within the graduate program meeting the specified criterion.<br><br>10 points for result; plus up to 10 progress points

In [7]:
df[df['ProgSkills'] < 4].groupby('Program')['ProgSkills'].count()  #counts the students in each Program where ProgSkills < 4

Program
Business Man                      1
Faculty!                          1
MBA                              13
MSIS                             30
Supply Chain Mgmt & Analytics     2
Name: ProgSkills, dtype: int64

#### Question 3:<br><br>The 'machine learning level' of a student is defined as the largest value among their level of knowledge of classification methods and their level of knowledge of clustering methods. These values are stored in the 'Classification' and 'Clustering' columns, respectively.<br><br>Compute and output the average (function mean) 'machine learning level' among all students in the MSIS program (students with value of 'MSIS' stored in the 'Program' column).<br><br>12 points for result; plus up to 12 progress points

In [8]:
temp = df[df['Program'] == 'MSIS']    #creates temporary dataframe for only MSIS students
avg = (temp[['Classification', 'Clustering']].max(axis = 1)).mean()    #finds the max between Classification and Clustering of each student, then calculates the average
avg    #prints results

1.875

#### Question 4:<br><br>Among all students with at least one year elapsed since their Bachelor's program graduation, determine which student is the 'most knowledgeable'. All survey respondents are considered students in the data science course.<br><br>The 'most knowledgeable' student has the highest level of knowledge on classification methods (value stored in the 'Classification' column). If there are multiple students with the highest level of knowledge on classification methods, to break the tie and determine the 'most knowledgeable' student, consider next whether these students know the C programming language ('C' column). In case of a further tie, consider next whether these students know the CPP programming language ('CPP' column), then consider the CS programming language ('CS' column), then consider the Java programming language ('Java' column), then consider the SAS programming language ('SAS' column).<br><br>Once the 'most knowledgeable' student has been determined, output their graduate program (value stored in the 'Program' column).<br><br>12 points for result; plus up to 12 progress points

In [9]:
df[(df['Bach_1to3'] == 1) | (df['Bach_3to5'] == 1) | (df['Bach_5Plus'] == 1)].nlargest(1,['Classification', 'C', 'CPP', 'CS', 'Java', 'SAS'])['Program']    #finds the student with more than 1 year elapsed since graduation and the most amount of langues in the specified order

16    Faculty!
Name: Program, dtype: object