<h3> Data Science : Analysis tutorial (50min from the Corey Schafer)</h3>



#### Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey

##### In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let's get started...

In [82]:
import csv

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    for line in csv_reader:
        print(line)
        break

{'Respondent': '1', 'MainBranch': 'I am a student who is learning to code', 'Hobbyist': 'Yes', 'OpenSourcer': 'Never', 'OpenSource': 'The quality of OSS and closed source software is about the same', 'Employment': 'Not employed, and not looking for work', 'Country': 'United Kingdom', 'Student': 'No', 'EdLevel': 'Primary/elementary school', 'UndergradMajor': 'NA', 'EduOther': 'Taught yourself a new language, framework, or tool without taking a formal course', 'OrgSize': 'NA', 'DevType': 'NA', 'YearsCode': '4', 'Age1stCode': '10', 'YearsCodePro': 'NA', 'CareerSat': 'NA', 'JobSat': 'NA', 'MgrIdiot': 'NA', 'MgrMoney': 'NA', 'MgrWant': 'NA', 'JobSeek': 'NA', 'LastHireDate': 'NA', 'LastInt': 'NA', 'FizzBuzz': 'NA', 'JobFactors': 'NA', 'ResumeUpdate': 'NA', 'CurrencySymbol': 'NA', 'CurrencyDesc': 'NA', 'CompTotal': 'NA', 'CompFreq': 'NA', 'ConvertedComp': 'NA', 'WorkWeekHrs': 'NA', 'WorkPlan': 'NA', 'WorkChallenge': 'NA', 'WorkRemote': 'NA', 'WorkLoc': 'NA', 'ImpSyn': 'NA', 'CodeRev': 'NA', '

In [83]:
#Look at the first question "Hobbyist":

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    for line in csv_reader:
        print(line['Hobbyist'])
        break

Yes


In [84]:
#First Method: use variables
#Retrieve the numbers of Yes and No: 

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    yes_count = 0
    no_count = 0 
    
    for line in csv_reader:
        if line['Hobbyist'] == 'Yes':
            yes_count += 1
        elif line['Hobbyist'] == 'No':
            no_count += 1

print(yes_count)
print(no_count)

71257
17626


In [85]:
#Retrieve the % of Yes and No: 

total = yes_count + no_count

print(yes_count/total)
print(no_count/total)


0.8016943622514991
0.19830563774850085


In [86]:
yes_pct = (yes_count/total) * 100
yes_pct = round(yes_pct, 2) #To have only two decimals

no_pct = (no_count/total) * 100
no_pct = round(no_pct, 2)

print(f'Yes: {yes_pct}%')
print(f'No: {no_pct}%')

Yes: 80.17%
No: 19.83%


In [87]:
#Second Method: use a dict (clean a bit the code)
#Retrieve the numbers and % of Yes and No: 

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    counts = {
        'Yes': 0,
        'No': 0
    }
    
    for line in csv_reader:
        counts[line['Hobbyist']] += 1

total = counts['Yes'] + counts['No']

yes_pct = (counts['Yes']/total) * 100
yes_pct = round(yes_pct, 2)

no_pct = (counts['No']/total) * 100
no_pct = round(no_pct, 2)

print(f'Yes: {yes_pct}%')
print(f'No: {no_pct}%')

Yes: 80.17%
No: 19.83%


In [88]:
#Third Method: use a default dict (don't need to initialyze values)
#Retrieve the numbers and % of Yes and No: 

from collections import defaultdict

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    counts = defaultdict(int)
    
    for line in csv_reader:
        counts[line['Hobbyist']] += 1

total = counts['Yes'] + counts['No']

yes_pct = (counts['Yes']/total) * 100
yes_pct = round(yes_pct, 2)

no_pct = (counts['No']/total) * 100
no_pct = round(no_pct, 2)

print(f'Yes: {yes_pct}%')
print(f'No: {no_pct}%')

Yes: 80.17%
No: 19.83%


In [89]:
#Fourth Method: use a counter (also clean the code and no initialization + additional features)
#Retrieve the numbers and % of Yes and No: 

from collections import Counter

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    counts = Counter()
    
    for line in csv_reader:
        counts[line['Hobbyist']] += 1

total = counts['Yes'] + counts['No']

yes_pct = (counts['Yes']/total) * 100
yes_pct = round(yes_pct, 2)

no_pct = (counts['No']/total) * 100
no_pct = round(no_pct, 2)

print(f'Yes: {yes_pct}%')
print(f'No: {no_pct}%')

Yes: 80.17%
No: 19.83%


In [90]:
#Look at the question "LanguageWorkedWith":

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    language_counter = Counter()
    
    for line in csv_reader:
        print(line['LanguageWorkedWith'])
        break

HTML/CSS;Java;JavaScript;Python


In [91]:
with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    language_counter = Counter()
    
    for line in csv_reader:
        languages = line['LanguageWorkedWith'].split(';')
        print(languages)
        break

['HTML/CSS', 'Java', 'JavaScript', 'Python']


In [92]:
#Count the languages

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    language_counter = Counter()
    
    for line in csv_reader:
        languages = line['LanguageWorkedWith'].split(';')
        
        for language in languages:
            language_counter[language] += 1
            
        print(language_counter)
        break

Counter({'HTML/CSS': 1, 'Java': 1, 'JavaScript': 1, 'Python': 1})


In [93]:
#Count the languages: use the update method

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    language_counter = Counter()
    
    for line in csv_reader:
        languages = line['LanguageWorkedWith'].split(';')

        language_counter.update(languages)
            
        print(language_counter)
        break

Counter({'HTML/CSS': 1, 'Java': 1, 'JavaScript': 1, 'Python': 1})


In [94]:
#Retrieve the five most common languages used

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    language_counter = Counter()
    
    for line in csv_reader:
        languages = line['LanguageWorkedWith'].split(';')

        language_counter.update(languages)
            
print(language_counter.most_common(5))

[('JavaScript', 59219), ('HTML/CSS', 55466), ('SQL', 47544), ('Python', 36443), ('Java', 35917)]


In [95]:
#Retrieve the % of most common languages used

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    total = 0
    
    language_counter = Counter()
    
    for line in csv_reader:
        languages = line['LanguageWorkedWith'].split(';')

        language_counter.update(languages)
        
        total += 1
            
for language, value in language_counter.most_common(5):
    language_pct = (value / total) * 100
    language_pct = round(language_pct, 2)
    
    print(f'{language}: {language_pct}%')

JavaScript: 66.63%
HTML/CSS: 62.4%
SQL: 53.49%
Python: 41.0%
Java: 40.41%


In [96]:
#Retrieve the developper types: 

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    dev_type_info = {}
    
    for line in csv_reader:
        dev_types = line['DevType'].split(';')
        
        for dev_type in dev_types:
            dev_type_info[dev_type] = {}
            
for key in dev_type_info:
    print(key)

NA
Developer, desktop or enterprise applications
Developer, front-end
Designer
Developer, back-end
Developer, full-stack
Academic researcher
Developer, mobile
Data or business analyst
Data scientist or machine learning specialist
Database administrator
Engineer, data
Engineer, site reliability
Developer, QA or test
DevOps specialist
Developer, game or graphics
Educator
Student
Engineering manager
Senior executive/VP
System administrator
Developer, embedded applications or devices
Product manager
Scientist
Marketing or sales professional


In [97]:
#Retrieve the most common languages based on each developer type: 

with open ('data/survey_results_public.csv') as f:
    csv_reader = csv.DictReader(f)
    
    dev_type_info = {}
    
    for line in csv_reader:
        dev_types = line['DevType'].split(';')
        
        for dev_type in dev_types:
            dev_type_info.setdefault(dev_type, {
                'total':0,
                'language_counter': Counter()
            })
            
            languages = line['LanguageWorkedWith'].split(';')
            dev_type_info[dev_type]['language_counter'].update(languages)
            dev_type_info[dev_type]['total'] += 1
            
for dev_type, info in dev_type_info.items():
    print(dev_type)

    for language, value in info['language_counter'].most_common(5):
        language_pct = (value / total) * 100
        language_pct = round(language_pct, 2)
        
        print(f'{language}: {language_pct}%')

NA
HTML/CSS: 4.66%
Python: 4.34%
JavaScript: 4.3%
Java: 3.63%
C++: 2.97%
Developer, desktop or enterprise applications
JavaScript: 13.22%
HTML/CSS: 12.58%
SQL: 12.38%
C#: 10.46%
Java: 8.71%
Developer, front-end
JavaScript: 26.3%
HTML/CSS: 25.07%
SQL: 17.58%
Java: 11.27%
PHP: 10.78%
Designer
HTML/CSS: 8.15%
JavaScript: 8.09%
SQL: 6.22%
PHP: 4.16%
Java: 4.07%
Developer, back-end
JavaScript: 33.05%
HTML/CSS: 29.93%
SQL: 29.29%
Java: 20.14%
Python: 18.61%
Developer, full-stack
JavaScript: 40.93%
HTML/CSS: 37.5%
SQL: 31.14%
Java: 19.35%
Bash/Shell/PowerShell: 18.01%
Academic researcher
Python: 4.07%
HTML/CSS: 3.73%
JavaScript: 3.62%
SQL: 3.17%
Java: 2.82%
Developer, mobile
JavaScript: 11.2%
HTML/CSS: 10.33%
Java: 9.46%
SQL: 8.48%
C#: 5.68%
Data or business analyst
SQL: 5.23%
HTML/CSS: 4.4%
JavaScript: 4.34%
Python: 3.67%
Bash/Shell/PowerShell: 2.72%
Data scientist or machine learning specialist
Python: 5.77%
SQL: 4.25%
JavaScript: 3.73%
HTML/CSS: 3.67%
Bash/Shell/PowerShell: 3.23%
Database 