# Data Understanding

## Dataset Overview
The Stack Overflow Developer Survey 2023 dataset contains responses from developers around the world. The survey includes questions about demographics, programming experience, job satisfaction, technologies used, and compensation, among other topics.

### Files in the Dataset:
- `survey_results_public.csv`: Contains the main survey results.
- `survey_results_schema.csv`: Contains the survey schema, explaining the questions corresponding to each column.

The survey was conducted from May 8, 2023, to May 19, 2023.


In [9]:
import io
import os
import glob
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.set_option("display.max_columns", 100)

In [10]:
directory_path = '../Datasets'

survey_results_path = f'{directory_path}/survey_results_public.csv'
schema_path = f'{directory_path}/survey_results_schema.csv'

survey_df = pd.read_csv(survey_results_path)
schema_df = pd.read_csv(schema_path)

print("Survey Results DataFrame:")
print(survey_df.head())
print("\nSchema DataFrame:")
print(schema_df.head())

# column names
print("\nColumn Names in Survey DataFrame:")
print(survey_df.columns)

Survey Results DataFrame:
   ResponseId     Q120                      MainBranch              Age  \
0           1  I agree                   None of these  18-24 years old   
1           2  I agree  I am a developer by profession  25-34 years old   
2           3  I agree  I am a developer by profession  45-54 years old   
3           4  I agree  I am a developer by profession  25-34 years old   
4           5  I agree  I am a developer by profession  25-34 years old   

                                          Employment  \
0                                                NaN   
1                                Employed, full-time   
2                                Employed, full-time   
3                                Employed, full-time   
4  Employed, full-time;Independent contractor, fr...   

                             RemoteWork  \
0                                   NaN   
1                                Remote   
2  Hybrid (some remote, some in-person)   
3  Hybrid (som

## Understanding the schema to decide which columns to prepare for EDA.

In [11]:
print("Schema DataFrame:")
print(schema_df.head())

for index, row in schema_df.iterrows():
    print(f"Column: {row['qname']}")
    print(f"Question: {row['question']}\n")


Schema DataFrame:
      qid     qname                                           question  \
0   QID16        S0  <div><span style="font-size:19px;"><strong>Hel...   
1   QID12  MetaInfo                                  Browser Meta Info   
2  QID310      Q310  <div><span style="font-size:19px;"><strong>You...   
3  QID312      Q120                                                      
4    QID1        S1  <span style="font-size:22px; font-family: aria...   

  force_resp  type selector  
0      False    DB       TB  
1      False  Meta  Browser  
2      False    DB       TB  
3       True    MC     SAVR  
4      False    DB       TB  
Column: S0
Question: <div><span style="font-size:19px;"><strong>Hello world! </strong></span></div>

<div> </div>

<div>Thank you for taking the 2023 Stack Overflow Developer Survey, the longest running survey of software developers (and anyone else who codes!) on Earth. </div>

<div> </div>

<div>There are seven sections in this survey. The 2nd, 3rd, 4th

## Understanding the basic statistics

In [12]:
print("Basic Statistics:")
print(survey_df.describe(include='all'))

Basic Statistics:
          ResponseId     Q120                      MainBranch  \
count   89184.000000    89184                           89184   
unique           NaN        1                               6   
top              NaN  I agree  I am a developer by profession   
freq             NaN    89184                           67237   
mean    44592.500000      NaN                             NaN   
std     25745.347541      NaN                             NaN   
min         1.000000      NaN                             NaN   
25%     22296.750000      NaN                             NaN   
50%     44592.500000      NaN                             NaN   
75%     66888.250000      NaN                             NaN   
max     89184.000000      NaN                             NaN   

                    Age           Employment  \
count             89184                87898   
unique                8                  106   
top     25-34 years old  Employed, full-time   
freq     

## Understanding the Data Types and Missig Values

In [13]:

print("Data Types:")
print(survey_df.dtypes)

print("Missing Values:")
print(survey_df.isnull().sum())

Data Types:
ResponseId               int64
Q120                    object
MainBranch              object
Age                     object
Employment              object
                        ...   
ProfessionalTech        object
Industry                object
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
Length: 84, dtype: object
Missing Values:
ResponseId                 0
Q120                       0
MainBranch                 0
Age                        0
Employment              1286
                       ...  
ProfessionalTech       47401
Industry               52410
SurveyLength            2699
SurveyEase              2630
ConvertedCompYearly    41165
Length: 84, dtype: int64


# Summary


## The Follwing columns are chosen to be used in the Analysis since they are in the scope of the purpise of the project.

### The following are the columns chosen for analysis

#### Age, Employment, RemoteWork, CodingActivities, EdLevel, YearsCode, YearsCodePro, DevType, OrgSize, Country, ConvertedCompYearly, LearnCodeCoursesCert, LanguageHaveWorkedWith, DatabaseHaveWorkedWith, PlatformHaveWorkedWith
