The objective of this notebook is answer the following four questions:

- what are the most used programing languages in 2020 and which programing languages are most desired by developers for 2020? 
-  What are the most popular database environments used in 2020? and which databases are developers most interested in learning for 2021?
-  what are the most popular web frameworks for 2020? What web frame works are developers most interested in learning for next year?
- What are the most popular libraries/frameworks for developers in 2020? Which libraries and frameworks are developers most interested in learning for next year? 

To answer the above questions, we explore the 2020 Stack Overflow Annual Developer Survey. The survey has over 65,000 responses from across 180 countries. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict

In [2]:
import altair as alt

In [3]:
df = pd.read_csv('./developer_survey_2020/survey_results_public.csv')
schema_df = pd.read_csv('./developer_survey_2020/survey_results_schema.csv',encoding= 'unicode_escape')

In [4]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27.0
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4.0
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4.0
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8.0


In [5]:
schema_df.head()

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,Age,What is your age (in years)? If you prefer not...
4,Age1stCode,At what age did you write your first line of c...


# What are the top programing used by developers in 2020?

In [6]:
df['LanguageWorkedWith']

0                                   C#;HTML/CSS;JavaScript
1                                         JavaScript;Swift
2                                 Objective-C;Python;Swift
3                                                      NaN
4                                        HTML/CSS;Ruby;SQL
                               ...                        
64456                                                  NaN
64457    Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458                                                  NaN
64459                                             HTML/CSS
64460                      C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64461, dtype: object

In [7]:
languages = ['Assembly', 'Bash/Shell/PowerShell', 'C','C#','C++','Dart' ,'Go', 'Haskell',
             'HTML/CSS', 'Java', 'JavaScript', 'Julia','Kotlin', 'Objective-C', 'Perl',
             'PHP', 'Python', 'R']

In [93]:
def clean_counts(df, col, values, col1, col2 = 'counts'):
    
    raw_df = df[col].value_counts().reset_index()
    clean_df = defaultdict(int)
    
    for val in values:    
        for index, row in raw_df.iterrows():
            if val in list(row)[0]:
                clean_df[val] += int(list(row)[1])
    
    clean_df = pd.DataFrame(pd.Series(clean_df)).reset_index()
    clean_df.columns = [col1, col2]
    clean_df = clean_df.reset_index(drop=True)
    
    return clean_df

In [117]:
# ploting function
# input : df1, df2 
# return plot
def create_plot(df1, df2):
    past_df = df1.copy()
    next_df = df2.copy()
    
    past_df['usage'] = 'Worked with in PAST year'
    next_df['usage'] = 'Want to work with NEXT year'
    
    concat_df = pd.concat([past_df, next_df])
    

    plot = alt.Chart(concat_df).mark_bar(opacity=0.9).encode(
        x='counts',
        y= alt.Y('language', sort='-x'),
        color='usage'
    ).properties(width=650)
    return plot
    

In [94]:
# Worked With in 2020
LanguageWorkedWith = clean_counts(df,'LanguageWorkedWith', languages, 'language')

In [95]:
LanguageDesireNextYear = clean_counts(df,'LanguageDesireNextYear', languages, 'language')

In [118]:
create_plot(LanguageWorkedWith, LanguageDesireNextYear)

In [76]:
bars = alt.Chart(source).mark_bar().encode(
    x= alt.X('counts', stack="normalize"),
    y= alt.Y('language', sort='-x'),
    color='usage'
)
bars

In [99]:
## Differntial Plot
copy = LanguageWorkedWith.copy()
copy['counts'] = LanguageDesireNextYear['counts'] - LanguageWorkedWith['counts']

In [100]:
copy

Unnamed: 0,language,counts
0,Assembly,-1084
1,Bash/Shell/PowerShell,-7252
2,C,-13124
3,C#,-4367
4,C++,-3951
5,Dart,2462
6,Go,7567
7,Haskell,1774
8,HTML/CSS,-15410
9,Java,-14078


In [119]:
alt.Chart(copy).mark_bar().encode(
    x="language",
    y="counts",
    color=alt.condition(
        alt.datum.nonfarm_change > 0,
        alt.value("steelblue"),  # The positive color
        alt.value("orange")  # The negative color
    )
).properties(width=650)

What does this graph tell us?

- this graph looks at the difference between the number of people who want to work with a language next year and who worked with it in the PAST year

In [12]:
# R = 7356 
16649 - 9293
# GO  = 7567
12605 - 5038 

7567

# Key Takeaways:
- C is the most used language and C is also the language which most developers on stackoverflow want to learn for next year
- Although Python is currently 5th most used language, it is ranked at 3rd for the most desired language for 2021
- R has the biggest upward differential between the number of developers currently using it and the number of developers (+ 7356  

# What are the most popular database environments used in 2020?

In [13]:
databases = ['Cassandra', 'Couchbase', 'DynamoDB', 'Elasticsearch', 'Firebase',
             'IBM DB2', 'MariaDB', 'Microsoft', 'SQL Server', 'MongoDB', 'MySQL', 
             'Oracle']

In [14]:
clean_counts(df,'DatabaseWorkedWith', databases, 'database')

Unnamed: 0,database,counts
0,MySQL,27559
1,Microsoft,16336
2,SQL Server,16336
3,MongoDB,13086
4,MariaDB,8312
5,Oracle,8155
6,Firebase,7128
7,Elasticsearch,6817
8,DynamoDB,3497
9,Cassandra,1654


# What are the most desired databases for next year?

In [15]:
clean_counts(df,'DatabaseDesireNextYear', databases, 'database')

Unnamed: 0,database,counts
0,MongoDB,16024
1,MySQL,15734
2,Elasticsearch,10269
3,Microsoft,9876
4,SQL Server,9876
5,Firebase,8600
6,MariaDB,6126
7,Oracle,4794
8,DynamoDB,4773
9,Cassandra,4227


# What are the most used platforms?

In [16]:
df['PlatformDesireNextYear']

0           Android;iOS;Kubernetes;Microsoft Azure;Windows
1                               iOS;Kubernetes;Linux;MacOS
2                                                      NaN
3                                                      NaN
4        Docker;Google Cloud Platform;Heroku;Linux;Windows
                               ...                        
64456                                                  NaN
64457                                                  NaN
64458                                                  NaN
64459                                                  NaN
64460                   Arduino;Linux;Raspberry Pi;Windows
Name: PlatformDesireNextYear, Length: 64461, dtype: object

In [17]:
platforms = ['Android','Arduino', 'AWS', 'Docker', 'Google Cloud Platform', 'Heroku', 
             'IBM Cloud or Watson iOS', 'Kubernetes', 'Linux', 'MacOS', 'Microsoft Azure', 
             'Raspberry Pi', 'Slack Apps and Integrations Windows'] 

In [18]:
clean_counts(df,'PlatformWorkedWith', platforms, 'platforms')

Unnamed: 0,platforms,counts
0,Linux,29600
1,Docker,18851
2,AWS,14389
3,Android,14101
4,MacOS,12898
5,Raspberry Pi,8010
6,Microsoft Azure,7830
7,Google Cloud Platform,7569
8,Kubernetes,6178
9,Heroku,5974


In [19]:
clean_counts(df,'PlatformDesireNextYear', platforms, 'platforms')

Unnamed: 0,platforms,counts
0,Linux,27475
1,Docker,23458
2,AWS,18381
3,Android,15085
4,Kubernetes,14009
5,MacOS,11793
6,Google Cloud Platform,11648
7,Raspberry Pi,11614
8,Microsoft Azure,9816
9,Arduino,6895


In [20]:
web_frameworks = ['Angular', 'Angular.js', 'ASP.NET', 'ASP.NET Core', 'Django', 'Drupal', 
                  'Express', 'Flask', 'Gatsby', 'jQuery', 'Laravel', 'React.js', 'Ruby on Rails', 
                  'Spring', 'Symfony', 'Vue.js']

In [21]:
clean_counts(df,'WebframeWorkedWith', web_frameworks, 'web_frameworks')

Unnamed: 0,web_frameworks,counts
0,jQuery,18316
1,React.js,15167
2,Angular,13481
3,ASP.NET,11572
4,Express,8961
5,ASP.NET Core,8082
6,Vue.js,7322
7,Spring,6941
8,Angular.js,6826
9,Django,6014


In [22]:
clean_counts(df,'WebframeDesireNextYear', web_frameworks, 'web_frameworks')

Unnamed: 0,web_frameworks,counts
0,React.js,20071
1,Vue.js,13142
2,Angular,13095
3,ASP.NET,10161
4,ASP.NET Core,9018
5,jQuery,8382
6,Django,8237
7,Express,8128
8,Spring,6241
9,Flask,6097


In [23]:
other_framworks = ['.NET', '.NET', 'Core', 'Ansible', 'Apache Spark', 'Chef', 
                 'Cordova', 'Flutter', 'Hadoop', 'Keras', 'Node.js', 'Pandas', 
                 'Puppet', 'React', 'Native', 'TensorFlow', 'Terraform', 'Torch/PyTorch', 'Unity', 
                 '3D Unreal Engine'] 

In [24]:
MiscTechWorkedWith = clean_counts(df,'MiscTechWorkedWith', other_framworks, 'other_framworks')

In [25]:
MiscTechDesireNextYear = 


SyntaxError: invalid syntax (<ipython-input-25-544fc84f14fe>, line 1)

In [30]:
import altair as alt
from vega_datasets import data
source = data.barley()

alt.Chart(source).mark_bar().encode(
    x='sum(yield)',
    y='variety',
    color='site',
    order=alt.Order(
      # Sort the segments of the bars by this field
      'site',
      sort='ascending'
    )
)

In [31]:
# my x-axis would be the # of people
# y would be the platform 
# the colour would be the Desired for next year vs. Worked with in the past

Unnamed: 0,yield,variety,year,site
0,27.00000,Manchuria,1931,University Farm
1,48.86667,Manchuria,1931,Waseca
2,27.43334,Manchuria,1931,Morris
3,39.93333,Manchuria,1931,Crookston
4,32.96667,Manchuria,1931,Grand Rapids
...,...,...,...,...
115,58.16667,Wisconsin No. 38,1932,Waseca
116,47.16667,Wisconsin No. 38,1932,Morris
117,35.90000,Wisconsin No. 38,1932,Crookston
118,20.66667,Wisconsin No. 38,1932,Grand Rapids
