<b>Project Name</b>: Learning the Code: Patterns in Learning and Career Outcomes

<b>Team Members</b>: Jie Luo, Kittnipatt Buranasiri, Lydia Schrandt

<b>Project Coach</b>: Alexander Levin-koopman

<b><font color='darkblue'>Introduction</font></b>

Welcome to this comprehensive analysis of the **2021 New Coder Survey**. This dataset encompasses responses from over **18,000 individuals** who shared their motivations, learning methods, and career aspirations related to coding.

<b><font color='darkblue'>Dataset Description</font></b>
The  dataset contains 63 columns that capture a wide range of information about individuals learning to code, including:

- <b>Demographic Information</b>: Age, gender identity, geographic location, marital status, and highest level of education completed.
- <b>Learning Motivations and Methods</b>: Reasons for learning to code, methods used (e.g., online courses, books, bootcamps), helpful online resources, and in-person event attendance.
- <b>Career Goals and Employment Status</b>: Current employment in software development, interest in a software development career, preferred job roles, salary expectations, and willingness to relocate.
- <b>Financial Information</b>: Amount spent on learning to code, household debt levels, and current income.
- <b>Job Satisfaction and Employment Dynamics</b>: Satisfaction with various aspects of current employment (e.g., earnings, benefits, job security), job search experience, and job stability.

<b><font color='darkblue'>Objectives 🔍 </font></b>

In this data exploration & analysis, we will:

1. **Get Data General Info**: Obtain an overview of the dataset's structure, including the number of entries, columns, data types, and missing values.
2. **Process the Natural Language-Based Columns**: Transform and extract meaningful features from columns containing natural language data.
3. **Data Cleaning and Preprocessing**: Handle missing values, correct data types, and address any inconsistencies in the dataset.
4. **Exploratory Data Analysis (EDA)**: Visualize and summarize the data to identify patterns and trends.
5. **Feature Engineering**: Create new features that can enhance the performance of our predictive models.
6. **Handling Data Imbalance**: Apply resampling techniques to address class imbalance, ensuring that our models are trained on a balanced dataset.

<b><font color='darkblue'>Getting Started 🚀</font></b>

To embark on our analysis of the 2021 New Coder Survey, we'll begin by acquiring the raw dataset directly from the original GitHub repository. This ensures that we're working with the most recent and unaltered version of the data. We'll utilize Python's pandas library to load the dataset seamlessly into our Google Colab environment.


In [None]:
### Importing Necessary Libraries
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical Analysis
import scipy.stats as stats

# Display settings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# Download the CSV file using wget
!wget -O survey.csv 'https://raw.githubusercontent.com/freeCodeCamp/2021-new-coder-survey/main/2021%20New%20Coder%20Survey.csv'

--2024-09-27 23:34:12--  https://raw.githubusercontent.com/freeCodeCamp/2021-new-coder-survey/main/2021%20New%20Coder%20Survey.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17514831 (17M) [text/plain]
Saving to: ‘survey.csv’


2024-09-27 23:34:13 (109 MB/s) - ‘survey.csv’ saved [17514831/17514831]



In [None]:
# Load the CSV into a pandas DataFrame
df = pd.read_csv('survey.csv')

# Display the first five rows of the DataFrame
df.head(5)

Unnamed: 0,Timestamp,1. What is your biggest reason for learning to code?,2. What methods have you used to learn about coding? Please select all that apply.,3. Which online learning resources have you found helpful? Please select all that apply.,"4. If you have attended in-person coding-related events before, which ones have you found helpful? Please select all that apply.","5. If you have listened to coding-related podcasts before, which ones have you found helpful? Please select all that apply.","6. If you have watched coding-related YouTube videos before, which channels have you found helpful? Please select all that apply.",7. About how many hours do you spend learning each week?,8. About how many months have you been programming?,"9. Aside from university tuition, about how much money have you spent on learning to code so far (in US Dollars)?",...,45. Please tell us how satisfied you are with each of these following aspects of your present job [Job security],45. Please tell us how satisfied you are with each of these following aspects of your present job [Work-life balance],45. Please tell us how satisfied you are with each of these following aspects of your present job [Professional growth or leadership opportunities],45. Please tell us how satisfied you are with each of these following aspects of your present job [Workplace/company culture],45. Please tell us how satisfied you are with each of these following aspects of your present job [Diverse and inclusive work environment],45. Please tell us how satisfied you are with each of these following aspects of your present job [Weekly workload],46. About how many minutes does it take you to get to work each day?,47. Have you served in your country's military before?,48. Do you currently receive disability benefits from your government?,49. Do you have high speed internet at your home?
0,7/1/2021 10:10:23,To succeed in current career,"Online resources, Books, In-person bootcamps, ...","freeCodeCamp, Mozilla Developer Network (MDN),...","conferences, workshops, Meetup.com events",The Changelog,"CS Dojo, freeCodeCamp",4.0,120,,...,Somewhat satisfied,Somewhat dissatisfied,I do not know,Somewhat satisfied,Somewhat satisfied,Very dissatisfied,I work from home,No,No,Yes
1,7/1/2021 10:31:01,To change careers,"Online resources, Books, Online bootcamps","freeCodeCamp, Mozilla Developer Network (MDN),...",I haven't attended any in-person coding-relate...,"The Changelog, Code Newbie Podcast","Adrian Twarog, Code with Ania Kubów, Coder Cod...",10.0,6,30.0,...,Very dissatisfied,Somewhat satisfied,Somewhat dissatisfied,Somewhat dissatisfied,Somewhat satisfied,Somewhat satisfied,15 to 29 minutes,No,Yes,Yes
2,7/1/2021 10:42:31,To change careers,"Online resources, Books, Hackathons, Meetup.co...","freeCodeCamp, Mozilla Developer Network (MDN),...",Meetup.com events,I haven't listened to any podcasts,"AmigosCode, Dev Ed, freeCodeCamp, Kevin Powell...",30.0,48,300.0,...,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,I am not working,No,No,Yes
3,7/1/2021 11:06:43,As a hobby,"Online resources, Books","freeCodeCamp, Mozilla Developer Network (MDN),...",I haven't attended any in-person coding-relate...,"Darknet Diaries, Real Python Podcast","freeCodeCamp, Traversy Media",,36,0.0,...,,,,,,,I am not working,No,No,No
4,7/1/2021 11:14:31,To start your first career,"Online resources, Books, Online bootcamps","freeCodeCamp, Stack Overflow, Coursera, Udemy",I haven't attended any in-person coding-relate...,Talk Python to Me,"freeCodeCamp, The Net Ninja, Traversy Media",2.0,24,5000.0,...,Somewhat dissatisfied,Somewhat satisfied,Somewhat dissatisfied,Somewhat satisfied,Somewhat dissatisfied,Somewhat dissatisfied,45 to 60 minutes,No,No,Yes


<b>Part I. Get Data General Info</b>

Let's quickly take a look into the dataset and obtain an overview of the dataset's structure.

In [5]:
# Display the shape of the DataFrame
#print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")

# Display information about data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18126 entries, 0 to 18125
Data columns (total 63 columns):
 #   Column                                                                                                                                                                      Non-Null Count  Dtype  
---  ------                                                                                                                                                                      --------------  -----  
 0   Timestamp                                                                                                                                                                   18126 non-null  object 
 1   1. What is your biggest reason for learning to code?                                                                                                                        17991 non-null  object 
 2   2. What methods have you used to learn about coding? Please select all that apply.

<font color='darkgreen'><b>Comments</b>:

- From the df.info() output, we could easily see that many of the columns have lengthy and descriptptive names. These long column names, while informative, can hinder readability and efficiency during data manipulation and analysis. Therefore, it would be beneficial to revise and shorten the column names.
- Additionally, there is a noticeable percentage of missing values across many columns. We would need to handling these missing data as it is important to maintain the integrity and reliability of the analysis.</font>

<font color='darkgrey'><i>Note: Given the unique nature of the raw dataset, which includes a significant amount of natural language data, we need to transform many columns into new features or attributes. Later, we will work on handling missing values until after this feature engineering process. Once the dataset has been converted into meaningful features, we will address the missing values accordingly.</i></font>

<b>Part II. Process the Natural Language-Based Columns</b>

In this section, we will transform the dataset's extensive natural language data into structured, numerical features suitable for analysis and modeling.

<b>Part III. Data Cleaning and Preprocessing</b>

Data cleaning and preprocessing are fundamental steps in any data analysis project. They ensure that the data is accurate, consistent, and suitable for analysis and modeling. In this section, we will: