### Paths

In [1]:
src_path = "../src" #from ./code
out_path = "../out" #from ./code
yearf_pattern = "/stack-overflow-developer-survey-20"
csv_pattern = "/survey_results_public.csv"

### Imports

In [2]:
import pandas as pd

## 2019

Período da pesquisa: 23 de Janeiro a 14 de Fevereiro de 2019.

In [3]:
year = "19"
filtered_csv = "/survey_results_20" + year + "_filtered.csv"

In [4]:
df = pd.read_csv(src_path + yearf_pattern + year + csv_pattern)

### Print All Columns

In [17]:
column_headers = df.columns

print("Colunas do arquivo .csv origem:", column_headers)

del column_headers

Colunas do arquivo .csv origem: Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOr

### Filter rows (Brazil)

In [18]:
brazil_rows = df[df['Country'] == 'Brazil']

In [19]:
print(brazil_rows.head(10))

     Respondent                                         MainBranch Hobbyist  \
18           19                     I am a developer by profession      Yes   
86           87                     I am a developer by profession      Yes   
129         130                     I am a developer by profession      Yes   
343         345                     I am a developer by profession      Yes   
348         350                     I am a developer by profession      Yes   
371         373                     I am a developer by profession      Yes   
409         411                     I am a developer by profession      Yes   
437         439                     I am a developer by profession      Yes   
569         572                     I am a developer by profession      Yes   
650         653  I am not primarily a developer, but I write co...       No   

                                           OpenSourcer  \
18                                               Never   
86            

### Selected columns

O esquema que descreve as colunas está no arquivo `survey_results_schema.csv` na pasta de cada ano. Ele é um arquivo em inglês que explica o que cada coluna representa.
Vamos escolher quais colunas utilizar na pesquisa:
- `Respondent` - O id randômico do participante da pesquisa.
- `MainBranch` - O que ele se considera. e.g. (developer, ...).
- `Hobbyist` - Aqueles que se consideram hobbistas.
- `Employment` - Ocupação profissional
- `Country` - (NÃO INCLUSO) País. Não necessário, já que as entries foram filtradas, pelo Brazil, i.e. `df['Country'] == "Brasil"`.
- `EdLevel` - Nível educacional.
- `OrgSize` - O tamanho da empresa do participante.
- `DevType` - Tipo de dev em que o participante se indentifica.
- `YearsCode` - (NÃO INCLUSO) Anos programando.(Creio que não necessário)
- `JobSat` - Nível de satisfação do trabalho.
- `JobSeek` - Procurando trabalho.
- `CurrencySymbol` - Corrência que o participante usa.
- `CompTotal` - Quanto recebe (com salários, bonus e gratificações, antes das taxas e deduções).
- `CompFreq` - A frequência que ele recebe `CompTotal`.
- `ConvertedComp` - Conversão para USD (2019-01-02) de `CompTotal` por `CompFreq`. Assumindo 12 meses de trabalhos ou 50 semanas de trabalhos ao ano.
- `WorkWeekHrs` - Quantas horas por semana trabalha.
- `PurchaseHow` - "Como sua empresa toma decisões sobre a compra de novas tecnologias (nuvem, IA, IoT, bancos de dados)?"
- `PurchaseWhat` - "Que nível de influência você, pessoalmente, tem sobre as compras de novas tecnologias em sua organização?"
- `LanguageWorkedWith` - Linguagens de programação usadas no ultimo ano.
- `LanguageDesireNextYear` - Linguagens que deseja trabalhar no próximo ano.
- `DatabaseWorkedWith`- Banco de dados usados no ultimo ano.
- `DatabaseDesireNextYear` - Banco de dados que deseja trabalhar no próximo ano.
- `PlatformWorkedWith` - Plataformas usadas no ultimo ano.
- `PlatformDesireNextYear` - Plataformas que deseja trabalhar no próximo ano.
- `WebFrameWorkedWith` - Web Frameworks usados no ultimo ano.
- `WebFrameDesireNextYear` - Web Frameworks que deseja trabalhar no próximo ano.
- `DevEnviron` - Qual ambiente de desenvolvimento usa regularmente.
- `OpSys` - Qual sistema operacional utiliza.
- `Age` - idade.
- `Gender` - gênero.

In [8]:
selected_columns = [
    "Respondent", "MainBranch", "Hobbyist", "OpenSourcer", "Employment", "EdLevel", "OrgSize", "YearsCodePro","YearsCode", "DevType", 
    "ConvertedComp", "LanguageWorkedWith", "LanguageDesireNextYear", "DatabaseWorkedWith", "DatabaseDesireNextYear", "PlatformWorkedWith", "PlatformDesireNextYear",
    "WebFrameWorkedWith", "WebFrameDesireNextYear", "DevEnviron", "OpSys", "Age", "Gender"
                   ]
print("Quantidade de colunas selecionadas:", len(selected_columns))

Quantidade de colunas selecionadas: 23


### Gerando arquivo .csv com as colunas selecionadas

É necessário mapear algumas colunas para seguirem o padrão dos três datasets mais recentes

In [21]:
column_rename_mapping = {
    "Respondent" : "ResponseID",
    "ConvertedComp" : "ConvertedCompYearly",
    "LanguageWorkedWith" : "LanguageHaveWorkedWith",
    "LanguageDesireNextYear" : "LanguageWantToWorkWith",
    "DatabaseWorkedWith" : "DatabaseHaveWorkedWith",
    "DatabaseDesireNextYear" : "DatabaseWantToWorkWith",
    "PlatformWorkedWith" : "PlatformHaveWorkedWith",
    "PlatformDesireNextYear" : "PlatformWantToWorkWith",
    "WebFrameWorkedWith" : "WebframeHaveWorkedWith",
    "WebFrameDesireNextYear" : "WebframeWantToWorkWith",
    "DevEnviron" : "NEWCollabToolsHaveWorkedWith",
}

In [22]:
filtered_data = brazil_rows[selected_columns].rename(columns=column_rename_mapping)
numero_de_linhas = len(filtered_data)
print("Número de linhas:", numero_de_linhas)
print("colunas\n" + ''.join(['- {}\n'.format(y) for y in filtered_data.columns]))

Número de linhas: 1948
colunas
- ResponseID
- MainBranch
- Hobbyist
- OpenSourcer
- Employment
- EdLevel
- OrgSize
- YearsCodePro
- YearsCode
- DevType
- ConvertedCompYearly
- LanguageHaveWorkedWith
- LanguageWantToWorkWith
- DatabaseHaveWorkedWith
- DatabaseWantToWorkWith
- PlatformHaveWorkedWith
- PlatformWantToWorkWith
- WebframeHaveWorkedWith
- WebframeWantToWorkWith
- NEWCollabToolsHaveWorkedWith
- OpSys
- Age
- Gender



In [1]:
coding_activities_values = []

# Iterating through rows of the original DataFrame
for index, row in filtered_data.iterrows():
    coding_activities = ""
    
    # Check if "Hobbyst" is "yes" and add to coding activities
    if row["Hobbyist"] == "Yes":
        coding_activities += "Hobby;"

    # Check if "OpenSourcer" is not "never" and add to coding activities
    if row["OpenSourcer"] != "Never" and row["OpenSourcer"] != "":
        coding_activities += "Contribute to open-source projects;"
        
    if coding_activities.endswith(";"):
        coding_activities = coding_activities[:-1]
        
    # Appending the calculated coding activities value to the list
    coding_activities_values.append(coding_activities)

NameError: name 'filtered_data' is not defined

In [12]:
filtered_data["CodingActivities"] = coding_activities_values

In [13]:
filtered_data.drop(columns=['Hobbyist', 'OpenSourcer'], inplace=True)

In [14]:
print("colunas\n" + ''.join(['- {}\n'.format(y) for y in filtered_data.columns]))

colunas
- ResponseID
- MainBranch
- Employment
- EdLevel
- OrgSize
- YearsCodePro
- YearsCode
- DevType
- ConvertedCompYearly
- LanguageHaveWorkedWith
- LanguageWantToWorkWith
- DatabaseHaveWorkedWith
- DatabaseWantToWorkWith
- PlatformHaveWorkedWith
- PlatformWantToWorkWith
- WebframeHaveWorkedWith
- WebframeWantToWorkWith
- NEWCollabToolsHaveWorkedWith
- OpSys
- Age
- Gender
- CodingActivities



In [15]:
filtered_data.to_csv((out_path + filtered_csv), index=False)

### Obs.
Foram mantidos linhas que possuem campos vazios.

## 2020

February 5 to February 28, 2020
USD salaries using the exchange rate on 2020-02-19

In [16]:
year = "20"
filtered_csv = "/survey_results_20" + year + "_filtered.csv"

In [17]:
# List to store headers of all CSV files
all_headers = []

# List to store common columns
common_columns = None

# List of filenames
csv_files = [ # src_path + yearf_pattern + "19" + csv_pattern, 
             # src_path + yearf_pattern + "20" + csv_pattern,
             src_path + yearf_pattern + "21" + csv_pattern,
             src_path + yearf_pattern + "22" + csv_pattern,
             src_path + yearf_pattern + "23" + csv_pattern,
            ]

for file in csv_files:
    # Load CSV file
    df = pd.read_csv(file)
    
    # Get the headers of the current CSV file
    headers = list(df.columns)
    all_headers.append(headers)
    
    # Update common_columns for the first file
    if common_columns is None:
        common_columns = set(headers)
    else:
        # Find the intersection of the current common_columns and the headers
        common_columns = common_columns.intersection(headers)

# Convert the set of common columns back to a list
common_columns = list(common_columns)

#print("Headers of each CSV file:")
for i, headers in enumerate(all_headers, start=1):
    #print(f"File {i} headers:", headers)
    pass

print("\nCommon columns in all CSV files:", common_columns)
print("Length of columns", len(common_columns))


Common columns in all CSV files: ['EdLevel', 'SurveyLength', 'SOVisitFreq', 'DatabaseWantToWorkWith', 'YearsCodePro', 'LearnCode', 'Country', 'ResponseId', 'OrgSize', 'SurveyEase', 'LanguageWantToWorkWith', 'SOComm', 'PlatformHaveWorkedWith', 'MiscTechWantToWorkWith', 'PlatformWantToWorkWith', 'Currency', 'SOPartFreq', 'DevType', 'SOAccount', 'WebframeWantToWorkWith', 'LanguageHaveWorkedWith', 'WebframeHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'MiscTechHaveWorkedWith', 'CompTotal', 'Employment', 'DatabaseHaveWorkedWith', 'ToolsTechHaveWorkedWith', 'Age', 'NEWCollabToolsHaveWorkedWith', 'MainBranch', 'YearsCode', 'ConvertedCompYearly', 'NEWSOSites', 'ToolsTechWantToWorkWith']
Length of columns 35
