# Getting Started

In [67]:
import pandas as pd

In [68]:
df = pd.read_csv('data/survey_results_public.csv')

In [69]:
df.shape

(88883, 85)

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88883 entries, 0 to 88882
Data columns (total 85 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Respondent              88883 non-null  int64  
 1   MainBranch              88331 non-null  object 
 2   Hobbyist                88883 non-null  object 
 3   OpenSourcer             88883 non-null  object 
 4   OpenSource              86842 non-null  object 
 5   Employment              87181 non-null  object 
 6   Country                 88751 non-null  object 
 7   Student                 87014 non-null  object 
 8   EdLevel                 86390 non-null  object 
 9   UndergradMajor          75614 non-null  object 
 10  EduOther                84260 non-null  object 
 11  OrgSize                 71791 non-null  object 
 12  DevType                 81335 non-null  object 
 13  YearsCode               87938 non-null  object 
 14  Age1stCode              87634 non-null

In [71]:
# pd.set_option('display.max_columns', df.shape[1])  # configure output to show all columns
pd.set_option('display.max_columns', 5)

Now we can load in the schema CSV file too, and find out what each of the columns in the survey results is all about.

In [72]:
schema = pd.read_csv('data/survey_results_schema.csv')
# pd.set_option('display.max_rows', df.shape[0])   # will allow us to see all 85 rows
pd.set_option('display.max_rows', 8)

Note we can't usually see all 85 rows in the schema, but we've tweaked the `display.max_rows` setting above too.

You can use `.head()` and `.tail()` to limit the number of rows if you don't want that many rows in your output.

In [73]:
schema.head(3)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?


# DataFrame and Series Basics

You can think of how you're accessing a DataFrame with square brackets as being a bit like accessing a dictionary of lists. The keys are like column labels and the lists are like the Series that define the rows.

In [74]:
people = {
    "first": ["Corey", "Jane"],
    "last": ["Schafer", "Doe"]
}

people["first"]

['Corey', 'Jane']

In [75]:
pd.DataFrame(people)

Unnamed: 0,first,last
0,Corey,Schafer
1,Jane,Doe


See the `Snippets.ipynb` notebook for code showing how to access rows and columns.

Let's look at what we can do with the main data set.

In [76]:
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
        ... 
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [77]:
df.loc[0, 'Hobbyist']  # first row, Hobbyist column

'Yes'

In [78]:
df.loc
df.loc[0:2, 'Hobbyist':'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time


# Indexes: set, reset and using

In [79]:
df.set_index('Respondent', inplace=True)  # could also use index_col in read_csv

How do we find the question for a column, given that we have the schema frame?

In [80]:
schema.head(3)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?


In [81]:
schema.set_index('Column', inplace=True)

In [82]:
schema.loc['MgrIdiot']

QuestionText    How confident are you that your manager knows ...
Name: MgrIdiot, dtype: object

In [83]:
schema.sort_index().head(3)

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
BetterLife,Do you think people born today will have a bet...


# Filtering

Who makes loads of money writing Python, in one of the countries we're most interested in?

In [84]:
countries = ['United States', 'India', 'United Kingdom', 'Germany', 'Canada']
in_country = df['Country'].isin(countries)

In [85]:
does_python = df['LanguageWorkedWith'].str.contains('Python', na=False)

In [86]:
high_salary = (df['ConvertedComp'] > 70000)

In [87]:
df[high_salary & does_python & in_country].head(3)

Unnamed: 0_level_0,MainBranch,Hobbyist,...,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
22,I am a developer by profession,Yes,...,Appropriate in length,Easy
26,I am a developer by profession,Yes,...,Appropriate in length,Easy
32,I am a developer by profession,No,...,Appropriate in length,Neither easy nor difficult


In [88]:
df.loc[high_salary, ['Country', 'LanguageWorkedWith', 'ConvertedComp']]

Unnamed: 0_level_0,Country,LanguageWorkedWith,ConvertedComp
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Canada,Java;R;SQL,366420.0
9,New Zealand,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;P...,95179.0
13,United States,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,90000.0
16,United Kingdom,Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;T...,455352.0
...,...,...,...
88877,United States,Bash/Shell/PowerShell;C;Clojure;HTML/CSS;Java;...,2000000.0
88878,United States,HTML/CSS;JavaScript;Scala;TypeScript,130000.0
88879,Finland,Bash/Shell/PowerShell;C++;Python,82488.0
88882,Netherlands,C#;HTML/CSS;Java;JavaScript;PHP;Python,588012.0


# Sorting data

Imagine we want to look at survey results by country and salary.

In [91]:
df.sort_values(by='Country', inplace=True)

In [95]:
df[['Country', 'ConvertedComp']].head(5)

Unnamed: 0_level_0,Country,ConvertedComp
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
39258,Afghanistan,19152.0
87091,Afghanistan,
58760,Afghanistan,
74386,Afghanistan,
29045,Afghanistan,


In [99]:
df.sort_values(
    by=['Country', 'ConvertedComp'],
    ascending=[True, False],
    inplace=True
)

In [100]:
df[['Country', 'ConvertedComp']].head(5)

Unnamed: 0_level_0,Country,ConvertedComp
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
63129,Afghanistan,1000000.0
50499,Afghanistan,153216.0
39258,Afghanistan,19152.0
58450,Afghanistan,17556.0
7085,Afghanistan,14364.0


Who are the top 10 most highly paid respondents in Afghanistan, and what do they do?

In [126]:
in_afghanistan = df['Country'] == 'Afghanistan'
df[in_afghanistan].nlargest(10, 'ConvertedComp')[['ConvertedComp', 'Employment', 'EdLevel', 'Student']]

Unnamed: 0_level_0,ConvertedComp,Employment,EdLevel,Student
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
63129,1000000.0,Employed full-time,I never completed any formal education,"Yes, full-time"
50499,153216.0,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",No
39258,19152.0,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",No
58450,17556.0,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",No
...,...,...,...,...
48436,4464.0,Employed full-time,"Secondary school (e.g. American high school, G...","Yes, part-time"
10746,3996.0,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",No
8149,1596.0,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Yes, full-time"
29736,1116.0,Employed full-time,Primary/elementary school,"Yes, part-time"
