# Getting Started

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/survey_results_public.csv')

In [3]:
df.shape

(88883, 85)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88883 entries, 0 to 88882
Data columns (total 85 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Respondent              88883 non-null  int64  
 1   MainBranch              88331 non-null  object 
 2   Hobbyist                88883 non-null  object 
 3   OpenSourcer             88883 non-null  object 
 4   OpenSource              86842 non-null  object 
 5   Employment              87181 non-null  object 
 6   Country                 88751 non-null  object 
 7   Student                 87014 non-null  object 
 8   EdLevel                 86390 non-null  object 
 9   UndergradMajor          75614 non-null  object 
 10  EduOther                84260 non-null  object 
 11  OrgSize                 71791 non-null  object 
 12  DevType                 81335 non-null  object 
 13  YearsCode               87938 non-null  object 
 14  Age1stCode              87634 non-null

In [5]:
# pd.set_option('display.max_columns', df.shape[1])  # configure output to show all columns

Now we can load in the schema CSV file too, and find out what each of the columns in the survey results is all about.

In [6]:
schema = pd.read_csv('data/survey_results_schema.csv')
# pd.set_option('display.max_rows', df.shape[0])   # will allow us to see all 85 rows

Note we can't usually see all 85 rows in the schema, but we've tweaked the `display.max_rows` setting above too.

You can use `.head()` and `.tail()` to limit the number of rows if you don't want that many rows in your output.

In [7]:
schema.head(3)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?


# DataFrame and Series Basics

You can think of how you're accessing a DataFrame with square brackets as being a bit like accessing a dictionary of lists. The keys are like column labels and the lists are like the Series that define the rows.

In [8]:
people = {
    "first": ["Corey", "Jane"],
    "last": ["Schafer", "Doe"]
}

people["first"]

['Corey', 'Jane']

In [9]:
pd.DataFrame(people)

Unnamed: 0,first,last
0,Corey,Schafer
1,Jane,Doe


See the `Snippets.ipynb` notebook for code showing how to access rows and columns.

Let's look at what we can do with the main data set.

In [10]:
df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [11]:
df.loc[0, 'Hobbyist']  # first row, Hobbyist column

'Yes'

In [12]:
df.loc
df.loc[0:2, 'Hobbyist':'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time


# Indexes: set, reset and using

In [13]:
df.set_index('Respondent', inplace=True)  # could also use index_col in read_csv

How do we find the question for a column, given that we have the schema frame?

In [14]:
schema.head(3)

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?


In [15]:
schema.set_index('Column', inplace=True)

In [17]:
schema.loc['MgrIdiot']

QuestionText    How confident are you that your manager knows ...
Name: MgrIdiot, dtype: object

In [18]:
schema.sort_index()

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
BetterLife,Do you think people born today will have a bet...
BlockchainIs,Blockchain / cryptocurrency technology is prim...
BlockchainOrg,How is your organization thinking about or imp...
...,...
WorkPlan,How structured or planned is your work?
WorkRemote,How often do you work remotely?
WorkWeekHrs,"On average, how many hours per week do you work?"
YearsCode,"Including any education, how many years have y..."
