The objective of this notebook is answer the following four questions:

- what are the most used programing languages in 2020 and which programing languages are most desired by developers for 2020? 
-  What are the most popular database environments used in 2020? and which databases are developers most interested in learning for 2021?
-  what are the most popular web frameworks for 2020? What web frame works are developers most interested in learning for next year?
- What are the most popular libraries/frameworks for developers in 2020? Which libraries and frameworks are developers most interested in learning for next year? 

To answer the above questions, we explore the 2020 Stack Overflow Annual Developer Survey. The survey has over 65,000 responses from across 180 countries. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
import altair as alt

In [2]:
df = pd.read_csv('./developer_survey_2020/survey_results_public.csv')
schema_df = pd.read_csv('./developer_survey_2020/survey_results_schema.csv',encoding= 'unicode_escape')

In [3]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27.0
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4.0
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4.0
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8.0


In [21]:
def clean_counts(df, col, values, col1, col2='Number_of_Developers'):
    raw_df = df[col].value_counts().reset_index()
    clean_df = defaultdict(int)
    for val in values:
        for index, row in raw_df.iterrows():
            if val in list(row)[0]:
                clean_df[val] += int(list(row)[1])
    clean_df = pd.DataFrame(pd.Series(clean_df)).reset_index()
    clean_df.columns = [col1, col2]
    clean_df = clean_df.reset_index(drop=True)
    return clean_df
    
    
def create_plot(df1, df2, y_axis):
    past_df = df1.copy()
    next_df = df2.copy()
    past_df['usage'] = 'Worked with in PAST year'
    next_df['usage'] = 'Want to work with NEXT year'
    concat_df = pd.concat([past_df, next_df])
    plot = alt.Chart(concat_df).mark_bar(opacity=0.9).encode(
        x=alt.X("Number_of_Developers", title="Number of Developers"),
        y=alt.Y(y_axis, sort='-x'),
        color='usage'
    ).properties(width=1000, height = 600)
    return plot
   
def diff_plot(df1, df2, x_axis):
    copy = df1.copy()
    copy['Number_of_Developers'] = \
        df2['Number_of_Developers'] - df1['Number_of_Developers']
    diff_plot = alt.Chart(copy).mark_bar().encode(
        x=x_axis,
        y=alt.Y("Number_of_Developers", title="Number of Developers"),
        color=alt.condition(
            alt.datum.Number_of_Developers > 0,
            alt.value("steelblue"),  # The positive color
            alt.value("orange")  # The negative color
        )
    ).properties(width=600, height = 600)
    return diff_plot
    

# What are the top programing used by developers in 2020?

In [5]:
df['LanguageWorkedWith']

0                                   C#;HTML/CSS;JavaScript
1                                         JavaScript;Swift
2                                 Objective-C;Python;Swift
3                                                      NaN
4                                        HTML/CSS;Ruby;SQL
                               ...                        
64456                                                  NaN
64457    Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458                                                  NaN
64459                                             HTML/CSS
64460                      C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64461, dtype: object

In [6]:
languages = ['Assembly', 'Bash/Shell/PowerShell', 'C','C#','C++','Dart' ,'Go', 'Haskell',
             'HTML/CSS', 'Java', 'JavaScript', 'Julia','Kotlin', 'Objective-C', 'Perl',
             'PHP', 'Python', 'R', 'Rust', 'Scala', 'SQL', 'Swift', 'TypeScript', 'VBA']

In [7]:
LanguageWorkedWith = clean_counts(df,'LanguageWorkedWith', languages, 'language')
LanguageWorkedWith

Unnamed: 0,language,Number_of_Developers
0,Assembly,3553
1,Bash/Shell/PowerShell,18980
2,C,46769
3,C#,18041
4,C++,13707
5,Dart,2280
6,Go,5038
7,Haskell,1222
8,HTML/CSS,36181
9,Java,45749


In [8]:
LanguageDesireNextYear = clean_counts(df,'LanguageDesireNextYear', languages, 'language')
LanguageDesireNextYear

Unnamed: 0,language,Number_of_Developers
0,Assembly,2469
1,Bash/Shell/PowerShell,11728
2,C,33645
3,C#,13674
4,C++,9756
5,Dart,4742
6,Go,12605
7,Haskell,2996
8,HTML/CSS,20771
9,Java,31671


In [20]:
create_plot(LanguageWorkedWith, LanguageDesireNextYear, 'language')

In [23]:
diff_plot(LanguageWorkedWith, LanguageDesireNextYear, 'language')

###  Which languages have developers done extensive development work in over the past year, and which do developers want to work in over the next year?

**What does this graph tell us?**

- the second graph looks at the difference between the number of people who want to continue or want to work with a language next year and who worked with it in the PAST year

# Key Takeaways:
- C and Java are the most used language, in terms of the total count. They are also the languages which **most** developers on stackoverflow want to learn for next year (33645 and 31671, respectively)  
- However, on the survey, **LESS** developers want to continue working or learning with C for NEXT year than wo 


- Although Python is currently 5th most used language by developers on stackoverflow, and it is ranked at 3rd for the most desired language for 2021
- There are also more people who want to work with Python NEXT year than worked with in PAST year-- therefore based on this data we can say: there is a very high probability that there will be **more** developers working with Python in 2021 than there were in 2020. 


- **Rust, Go and R have the biggest upward-positive differential** between the number of developers who used them extensively LAST year and the number of developers who want to continue to work with NEXT year -- we should expect to see both Go and R continue go grow in 2021!

### Which database environments have developers done extensive development work in over this past year, and which databases they want to work in over the next year? 

In [11]:
databases = ['Cassandra', 'Couchbase', 'DynamoDB', 'Elasticsearch', 'Firebase',
             'IBM DB2', 'MariaDB', 'Microsoft', 'SQL Server', 'MongoDB', 'MySQL', 
             'Oracle', 'Redis', 'PostgreSQL', 'SQLite']

In [12]:
DatabaseWorkedWith = clean_counts(df,'DatabaseWorkedWith', databases, 'databases')
DatabaseWorkedWith

Unnamed: 0,databases,Number_of_Developers
0,Cassandra,1654
1,Couchbase,937
2,DynamoDB,3497
3,Elasticsearch,6817
4,Firebase,7128
5,IBM DB2,1421
6,MariaDB,8312
7,Microsoft,16336
8,SQL Server,16336
9,MongoDB,13086


In [13]:
DatabaseDesireNextYear = clean_counts(df,'DatabaseDesireNextYear', databases, 'databases')
DatabaseDesireNextYear

Unnamed: 0,databases,Number_of_Developers
0,Cassandra,4227
1,Couchbase,1678
2,DynamoDB,4773
3,Elasticsearch,10269
4,Firebase,8600
5,IBM DB2,935
6,MariaDB,6126
7,Microsoft,9876
8,SQL Server,9876
9,MongoDB,16024


In [14]:
create_plot(DatabaseWorkedWith, DatabaseDesireNextYear, 'databases')

In [15]:
diff_plot(DatabaseWorkedWith, DatabaseDesireNextYear, 'databases')

- According to the survey, the most popular database by far is MySQL, followed by Microsoft and SQL server

- The interesting thing to note is that MySQL also has the largest negative differential between the number of developers that want to continue working with it or desire to work with it in 2021 versus number of developers that worked with it in 2020 
 

- Interestingly MongoDB, Elasticsearch have the largest positive difference between the number of developers that want to continue working with the database or learn it versus the number of developers that worked with it in 2020

# What are the most used platforms?

In [21]:
df['PlatformDesireNextYear']

0           Android;iOS;Kubernetes;Microsoft Azure;Windows
1                               iOS;Kubernetes;Linux;MacOS
2                                                      NaN
3                                                      NaN
4        Docker;Google Cloud Platform;Heroku;Linux;Windows
                               ...                        
64456                                                  NaN
64457                                                  NaN
64458                                                  NaN
64459                                                  NaN
64460                   Arduino;Linux;Raspberry Pi;Windows
Name: PlatformDesireNextYear, Length: 64461, dtype: object

### As we can see we need to clean this column before we can do anything

In [22]:
platforms = ['Android','Arduino', 'AWS', 'Docker', 'Google Cloud Platform', 'Heroku', 
             'IBM Cloud or Watson iOS', 'Kubernetes', 'Linux', 'MacOS', 'Microsoft Azure', 
             'Raspberry Pi', 'Slack Apps and Integrations, Windows', 'WordPress']

In [23]:
PlatformWorkedWith = clean_counts(df,'PlatformWorkedWith', platforms, 'platforms')
PlatformWorkedWith

Unnamed: 0,platforms,Number_of_Developers
0,Android,14101
1,Arduino,5712
2,AWS,14389
3,Docker,18851
4,Google Cloud Platform,7569
5,Heroku,5974
6,Kubernetes,6178
7,Linux,29600
8,MacOS,12898
9,Microsoft Azure,7830


In [24]:
PlatformDesireNextYear = clean_counts(df,'PlatformDesireNextYear', platforms, 'platforms')
PlatformDesireNextYear

Unnamed: 0,platforms,Number_of_Developers
0,Android,15085
1,Arduino,6895
2,AWS,18381
3,Docker,23458
4,Google Cloud Platform,11648
5,Heroku,5071
6,Kubernetes,14009
7,Linux,27475
8,MacOS,11793
9,Microsoft Azure,9816


In [25]:
create_plot(PlatformWorkedWith, PlatformDesireNextYear, 'platforms')

In [26]:
diff_plot(PlatformWorkedWith, PlatformDesireNextYear, 'platforms')

- this is an interesting category because it contains many different types of platform which cannot really be compared. For instance, it doesn't make much sense to compare Linux to AWS or Docker Raspberry Pi, nevertheless, we can still obtain interesting insights from the data

- Kubernetes seems to be a highly desired framework which developers want to work with for 2021 
- Although AWS is the most popular cloud service by far -- AWS, GCP and Azure all have a positive differential between the number of developers that want to continue working with them next year or desire to work with them versus the number of developers which worked with them in 2020 

In [27]:
web_frameworks = ['Angular', 'Angular.js', 'ASP.NET', 'ASP.NET Core', 'Django', 'Drupal', 
                  'Express', 'Flask', 'Gatsby', 'jQuery', 'Laravel', 'React.js', 'Ruby on Rails', 
                  'Spring', 'Symfony', 'Vue.js']

In [31]:
WebframeWorkedWith=clean_counts(df,'WebframeWorkedWith', web_frameworks, 'web_frameworks')

In [32]:
WebframeDesireNextYear = clean_counts(df,'WebframeDesireNextYear', web_frameworks, 'web_frameworks')

In [35]:
create_plot(WebframeWorkedWith, WebframeDesireNextYear, 'web_frameworks')

In [36]:
diff_plot(WebframeWorkedWith, WebframeDesireNextYear, 'web_frameworks')

In [30]:
other_framworks = ['.NET', '.NET', 'Core', 'Ansible', 'Apache Spark', 'Chef', 
                 'Cordova', 'Flutter', 'Hadoop', 'Keras', 'Node.js', 'Pandas', 
                 'Puppet', 'React', 'Native', 'TensorFlow', 'Terraform', 'Torch/PyTorch', 'Unity', 
                 '3D Unreal Engine','Xamarin']

In [39]:
MiscTechWorkedWith = clean_counts(df,'MiscTechWorkedWith', other_framworks, 'framework')

In [40]:
MiscTechDesireNextYear = clean_counts(df,'MiscTechDesireNextYear', other_framworks, 'framework')

In [41]:
create_plot(MiscTechWorkedWith, MiscTechDesireNextYear, 'framework')

In [42]:
diff_plot(MiscTechWorkedWith, MiscTechDesireNextYear, 'framework')

In [43]:
diff_plot(MiscTechWorkedWith, MiscTechDesireNextYear, 'framework')