# Outline

- Research Questions:
    - q1
    - q2
    - q3
    
- Research on Data:
    - Preparation
    - Analysis
    - Modeling
    - Visualization

In [1]:
# The code uses comments effectively and/or Notebook Markdown cells correctly.
# The steps of the data science process (gather, assess, clean, analyze, model, visualize)
# are clearly identified with comments or Markdown cells, as well. The naming for variables
# and functions should be according to PEP8 style guide.

# Code is well documented and uses functions and classes as necessary.
# All functions include document strings.
# DRY principles are implemented.

### Step 1: Gather Data

In [3]:
%cd airbnb_munich_2019_data
!ls

C:\Users\phill\Desktop\prog_work\udacity_projects\dsend_project_1\airbnb_munich_2019_data
calendar.csv.gz
listings.csv
listings.csv.gz
neighbourhoods.csv
neighbourhoods.geojson
reviews.csv
reviews.csv.gz


In [3]:
import pandas as pd
pd.__version__

'0.25.3'

In [7]:
df_schema = pd.read_csv('survey_results_schema.csv')
df = pd.read_csv('survey_results_public.csv')

In [8]:
df_schema

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...
...,...,...
80,Sexuality,Which of the following do you currently identi...
81,Ethnicity,Which of the following do you identify as? Ple...
82,Dependents,"Do you have any dependents (e.g., children, el..."
83,SurveyLength,How do you feel about the length of the survey...


In [9]:
df

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,...,,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88879,88601,,No,Never,The quality of OSS and closed source software ...,,,,,,...,,,,,,,,,,
88880,88802,,No,Never,,Employed full-time,,,,,...,,,,,,,,,,
88881,88816,,No,Never,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",,,,,...,,,,,,,,,,


In [14]:
today = df.groupby('MainBranch').count()
today

Unnamed: 0_level_0,Respondent,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
MainBranch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
I am a developer by profession,65679,65679,65679,64500,65354,65679,64775,64426,60109,62852,...,63663,50720,59667,63460,62312,57315,57371,62003,64473,64539
I am a student who is learning to code,10189,10189,10189,9767,9415,10189,9805,9648,5501,9286,...,9749,8849,8585,9685,9342,8130,8296,9261,9923,9947
"I am not primarily a developer, but I write code sometimes as part of my work",7539,7539,7539,7377,7466,7539,7415,7370,6714,7240,...,7328,5782,6662,7257,7113,6481,6620,7022,7394,7393
I code primarily as a hobby,3340,3340,3340,3167,2936,3340,3103,3033,1596,3055,...,3194,2727,2701,3125,3034,2652,2746,2977,3254,3258
"I used to be a developer by profession, but no longer am",1584,1584,1584,1540,1550,1584,1542,1537,1420,1500,...,1537,1182,1317,1509,1459,1285,1332,1453,1550,1557


In [30]:
df.MainBranch = df.MainBranch.fillna('No response')

In [31]:
df.MainBranch.value_counts() * 100 / len(df.MainBranch)

I am a developer by profession                                                   73.893770
I am a student who is learning to code                                           11.463384
I am not primarily a developer, but I write code sometimes as part of my work     8.481937
I code primarily as a hobby                                                       3.757749
I used to be a developer by profession, but no longer am                          1.782118
No response                                                                       0.621041
Name: MainBranch, dtype: float64

In [34]:
df_me = df[df.MainBranch == 'I am not primarily a developer, but I write code sometimes as part of my work']
df_me.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7539 entries, 2 to 88327
Data columns (total 85 columns):
Respondent                7539 non-null int64
MainBranch                7539 non-null object
Hobbyist                  7539 non-null object
OpenSourcer               7539 non-null object
OpenSource                7377 non-null object
Employment                7466 non-null object
Country                   7539 non-null object
Student                   7415 non-null object
EdLevel                   7370 non-null object
UndergradMajor            6714 non-null object
EduOther                  7240 non-null object
OrgSize                   6847 non-null object
DevType                   6869 non-null object
YearsCode                 7455 non-null object
Age1stCode                7434 non-null object
YearsCodePro              7349 non-null object
CareerSat                 7438 non-null object
JobSat                    7136 non-null object
MgrIdiot                  5622 non-null object


In [40]:
background_1 = df_schema[df_schema.QuestionText == 'What was your main or most important field of study?']
background_1

Unnamed: 0,Column,QuestionText
9,UndergradMajor,What was your main or most important field of ...


In [42]:
background_2 = df_schema[df_schema.QuestionText == 'Which of the following types of nondegree education have you used or participated in?']
background_2

df_schema.Column[10]
df_schema[df_schema.Column == 'EduOther']

Unnamed: 0,Column,QuestionText
10,EduOther,Which of the following types of non-degree edu...


In [45]:
len(df_me)

7539

In [50]:
df_me.UndergradMajor.value_counts() * 100 / len(df_me)

Computer science, computer engineering, or software engineering          31.688553
Another engineering discipline (ex. civil, electrical, mechanical)       13.927577
No response                                                              10.943096
A natural science (ex. biology, chemistry, physics)                      10.133970
Information systems, information technology, or system administration     8.926913
Mathematics or statistics                                                 6.685237
A business discipline (ex. accounting, finance, marketing)                4.761905
A social science (ex. anthropology, psychology, political science)        3.634434
A humanities discipline (ex. literature, history, philosophy)             3.170182
Fine arts or performing arts (ex. graphic design, music, studio art)      2.215148
Web development or web design                                             1.883539
I never declared a major                                                  1.074413
A he

In [53]:
df_me.EduOther.value_counts() * 100 / len(df_me)

Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course                                                                                                                                                                                                                                                                                9.908476
Taught yourself a new language, framework, or tool without taking a formal course                                                                                                                                                                                                                                                                                                                                                            9.377902
Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software       

In [52]:
df.UndergradMajor.value_counts() * 100 / len(df)

Computer science, computer engineering, or software engineering          53.119269
Another engineering discipline (ex. civil, electrical, mechanical)        7.000214
Information systems, information technology, or system administration     5.910017
Web development or web design                                             3.850005
A natural science (ex. biology, chemistry, physics)                       3.636241
Mathematics or statistics                                                 3.347097
A business discipline (ex. accounting, finance, marketing)                2.071262
A humanities discipline (ex. literature, history, philosophy)             1.767492
A social science (ex. anthropology, psychology, political science)        1.521101
Fine arts or performing arts (ex. graphic design, music, studio art)      1.387217
I never declared a major                                                  1.098073
A health science (ex. nursing, pharmacy, radiology)                       0.363399
Name

In [54]:
df_really_me = df_me[df_me.UndergradMajor == 'A business discipline (ex. accounting, finance, marketing)']

In [59]:
business_guys = df[df.UndergradMajor == 'A business discipline (ex. accounting, finance, marketing)']

In [70]:
cs_guys = df[df.UndergradMajor == 'Computer science, computer engineering, or software engineering']

In [72]:
math_guys = df[df.UndergradMajor == 'Mathematics or statistics']

In [74]:
art_guys = df[df.UndergradMajor == 'Fine arts or performing arts (ex. graphic design, music, studio art)']

In [77]:
culture_guys = df[df.UndergradMajor == 'A humanities discipline (ex. literature, history, philosophy)']

In [79]:
social_guys = df[df.UndergradMajor == 'A social science (ex. anthropology, psychology, political science)']

In [63]:
business_guys.MainBranch.value_counts() * 100 / len(business_guys)

I am a developer by profession                                                   62.031505
I am not primarily a developer, but I write code sometimes as part of my work    19.500272
I am a student who is learning to code                                            8.853884
I code primarily as a hobby                                                       6.735470
I used to be a developer by profession, but no longer am                          2.227051
No response                                                                       0.651820
Name: MainBranch, dtype: float64

In [71]:
cs_guys.MainBranch.value_counts() * 100 / len(cs_guys)

I am a developer by profession                                                   84.479180
I am a student who is learning to code                                            7.470242
I am not primarily a developer, but I write code sometimes as part of my work     5.059940
I used to be a developer by profession, but no longer am                          1.743127
I code primarily as a hobby                                                       1.035710
No response                                                                       0.211802
Name: MainBranch, dtype: float64

In [73]:
math_guys.MainBranch.value_counts() * 100 / len(math_guys)

I am a developer by profession                                                   70.689076
I am not primarily a developer, but I write code sometimes as part of my work    16.941176
I am a student who is learning to code                                            5.478992
I code primarily as a hobby                                                       3.495798
I used to be a developer by profession, but no longer am                          2.722689
No response                                                                       0.672269
Name: MainBranch, dtype: float64

In [75]:
art_guys.MainBranch.value_counts() * 100 / len(art_guys)

I am a developer by profession                                                   74.939173
I am not primarily a developer, but I write code sometimes as part of my work    13.544201
I am a student who is learning to code                                            5.190592
I code primarily as a hobby                                                       4.136253
I used to be a developer by profession, but no longer am                          1.622060
No response                                                                       0.567721
Name: MainBranch, dtype: float64

In [78]:
culture_guys.MainBranch.value_counts() * 100 / len(culture_guys)

I am a developer by profession                                                   72.756206
I am not primarily a developer, but I write code sometimes as part of my work    15.213240
I am a student who is learning to code                                            5.346913
I code primarily as a hobby                                                       4.264799
I used to be a developer by profession, but no longer am                          1.527689
No response                                                                       0.891152
Name: MainBranch, dtype: float64

In [80]:
social_guys.MainBranch.value_counts() * 100 / len(social_guys)

I am a developer by profession                                                   65.310651
I am not primarily a developer, but I write code sometimes as part of my work    20.266272
I am a student who is learning to code                                            7.026627
I code primarily as a hobby                                                       4.437870
I used to be a developer by profession, but no longer am                          2.071006
No response                                                                       0.887574
Name: MainBranch, dtype: float64

In [None]:
# Relative to the respective educational background, how much percent work as developers by profession?

In [84]:
# what do the business people do?

In [103]:
business_guys[df_schema.Column[29]].value_counts()

80000.0     37
90000.0     28
120000.0    26
60000.0     26
100000.0    25
            ..
340000.0     1
66000.0      1
580000.0     1
12300.0      1
101500.0     1
Name: CompTotal, Length: 357, dtype: int64

# AirBnB data

- most expensive area for staying over night in Munich?
- does a good ranking influence whether a room gets booked? do more ratings help? why have some rooms not been booked?

In [6]:
import pandas as pd
# Shapely
# geopandas
# gdal

In [10]:
%cd airbnb_munich_2019_data/
!ls

C:\Users\phill\Desktop\prog_work\udacity_projects\dsend_project_1\airbnb_munich_2019_data
calendar.csv.gz
listings.csv
listings.csv.gz
neighbourhoods.csv
neighbourhoods.geojson
reviews.csv
reviews.csv.gz


In [11]:
df_calendar = pd.read_csv('calendar.csv.gz', compression='gzip')
df_calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4190565 entries, 0 to 4190564
Data columns (total 7 columns):
listing_id        int64
date              object
available         object
price             object
adjusted_price    object
minimum_nights    int64
maximum_nights    int64
dtypes: int64(3), object(4)
memory usage: 223.8+ MB


In [12]:
df_calendar

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,216529,2019-11-25,f,$150.00,$150.00,5,60
1,159634,2019-11-25,f,$53.00,$53.00,14,90
2,159634,2019-11-26,f,$53.00,$53.00,14,90
3,159634,2019-11-27,f,$53.00,$53.00,14,90
4,159634,2019-11-28,f,$53.00,$53.00,14,90
...,...,...,...,...,...,...,...
4190560,40287455,2020-11-19,f,$43.00,$43.00,26,1125
4190561,40287455,2020-11-20,f,$43.00,$43.00,26,1125
4190562,40287455,2020-11-21,f,$43.00,$43.00,26,1125
4190563,40287455,2020-11-22,f,$43.00,$43.00,26,1125


In [98]:
len(df_calendar.listing_id.unique())

# validated: df_calendar has time series data on the listings from df_listings 

11481

In [99]:
11481 * 365

4190565

In [94]:
df_calendar_price_num = df_calendar['price'].str.strip('$')
df_calendar_price_num = df_calendar_price_num.str.replace(',', '') # strip(',') did not work for me here
df_calendar_price_num = df_calendar_price_num.astype('float')

In [96]:
df_calendar_price_num.describe()

count    4.190394e+06
mean     1.138422e+02
std      1.575467e+02
min      9.000000e+00
25%      5.500000e+01
50%      8.200000e+01
75%      1.300000e+02
max      9.039000e+03
Name: price, dtype: float64

In [30]:
df_listings = pd.read_csv('listings.csv')
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11481 entries, 0 to 11480
Data columns (total 16 columns):
id                                11481 non-null int64
name                              11459 non-null object
host_id                           11481 non-null int64
host_name                         11458 non-null object
neighbourhood_group               0 non-null float64
neighbourhood                     11481 non-null object
latitude                          11481 non-null float64
longitude                         11481 non-null float64
room_type                         11481 non-null object
price                             11481 non-null int64
minimum_nights                    11481 non-null int64
number_of_reviews                 11481 non-null int64
last_review                       8927 non-null object
reviews_per_month                 8927 non-null float64
calculated_host_listings_count    11481 non-null int64
availability_365                  11481 non-null int64
dtyp

In [31]:
df_listings

# validated latitude & longitude data with a GPS coordinates finder (https://www.gps-coordinates.net/)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,36720,"Beautiful 2 rooms flat, Glockenbach",158413,Gabriela,,Ludwigsvorstadt-Isarvorstadt,48.13057,11.56929,Entire home/apt,95,2,25,2017-07-22,0.37,1,0
1,97945,Deluxw-Apartm. with roof terrace,517685,Angelika,,Hadern,48.11476,11.48782,Entire home/apt,80,2,131,2019-10-03,1.32,1,84
2,114695,Apartment Munich/East with sundeck,581737,,,Berg am Laim,48.11923,11.63726,Entire home/apt,95,1,53,2019-10-06,0.52,2,163
3,127383,City apartment next to Pinakothek,630556,Sonja,,Maxvorstadt,48.15198,11.56486,Entire home/apt,120,3,82,2019-07-21,0.79,2,0
4,157808,"Near Olympia,English Garden",759734,Christian,,Schwabing-West,48.16381,11.56089,Private room,35,1,0,,,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11476,40342200,Charmante Altbauwohnung im Herzen von München,84755521,Kathi,,Neuhausen-Nymphenburg,48.14833,11.54345,Entire home/apt,80,1,0,,,1,34
11477,40343877,Central apartment near the Bavaria,276730632,Tymothé,,Sendling,48.12343,11.53969,Private room,180,6,0,,,4,23
11478,40344151,Room in Laim,35733965,Nauman,,Laim,48.13829,11.51235,Private room,30,1,0,,,1,33
11479,40347084,Neuhausen mit persönlichem Ambiente !,13214071,Katja,,Neuhausen-Nymphenburg,48.14794,11.53698,Entire home/apt,40,30,0,,,1,65


In [39]:
df_listings.availability_365[df_listings.availability_365 == 365]

65       365
70       365
86       365
120      365
122      365
        ... 
11265    365
11315    365
11347    365
11397    365
11463    365
Name: availability_365, Length: 192, dtype: int64

In [16]:
df_neighbourhoods = pd.read_csv('neighbourhoods.csv')
df_neighbourhoods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
neighbourhood_group    0 non-null float64
neighbourhood          25 non-null object
dtypes: float64(1), object(1)
memory usage: 528.0+ bytes


In [26]:
df_neighbourhoods

Unnamed: 0,neighbourhood_group,neighbourhood
0,,Allach-Untermenzing
1,,Altstadt-Lehel
2,,Aubing-Lochhausen-Langwied
3,,Au-Haidhausen
4,,Berg am Laim
5,,Bogenhausen
6,,Feldmoching-Hasenbergl
7,,Hadern
8,,Laim
9,,Ludwigsvorstadt-Isarvorstadt


In [17]:
df_neighbourhoods_geo = pd.read_json('neighbourhoods.geojson')
df_neighbourhoods_geo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
type        25 non-null object
features    25 non-null object
dtypes: object(2)
memory usage: 528.0+ bytes


In [28]:
df_neighbourhoods_geo['features'].iloc[1]

{'type': 'Feature',
 'geometry': {'type': 'MultiPolygon',
  'coordinates': [[[[11.556000000000001, 48.1408],
     [11.5593, 48.1406],
     [11.561, 48.1411],
     [11.5611, 48.1412],
     [11.561, 48.1405],
     [11.561, 48.1402],
     [11.5643, 48.1395],
     [11.5655, 48.1395],
     [11.5652, 48.1387],
     [11.5651, 48.1381],
     [11.5649, 48.1371],
     [11.5649, 48.1361],
     [11.5656, 48.1351],
     [11.5667, 48.1335],
     [11.5668, 48.1333],
     [11.5676, 48.132],
     [11.5685, 48.1312],
     [11.5706, 48.1308],
     [11.5716, 48.1314],
     [11.573, 48.1321],
     [11.5749, 48.1333],
     [11.5765, 48.1332],
     [11.5791, 48.1338],
     [11.58, 48.1339],
     [11.582, 48.1337],
     [11.5828, 48.1339],
     [11.5836, 48.133],
     [11.5851, 48.1323],
     [11.586, 48.1319],
     [11.587, 48.1316],
     [11.5867, 48.1312],
     [11.5857, 48.1303],
     [11.5848, 48.1295],
     [11.5837, 48.1286],
     [11.5826, 48.1284],
     [11.5813, 48.1284],
     [11.5796, 48.1279],
  

In [101]:
df_reviews = pd.read_csv('reviews.csv.gz', compression='gzip')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175562 entries, 0 to 175561
Data columns (total 6 columns):
listing_id       175562 non-null int64
id               175562 non-null int64
date             175562 non-null object
reviewer_id      175562 non-null int64
reviewer_name    175562 non-null object
comments         175488 non-null object
dtypes: int64(3), object(3)
memory usage: 8.0+ MB


In [21]:
# not taken, because only 2 columns

df_reviews = pd.read_csv('reviews.csv')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175562 entries, 0 to 175561
Data columns (total 2 columns):
listing_id    175562 non-null int64
date          175562 non-null object
dtypes: int64(1), object(1)
memory usage: 2.7+ MB


In [24]:
df_reviews

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,36720,12723661,2014-05-09,11840468,Mikhail,I would like to thank Gabriella as her apartme...
1,36720,13147830,2014-05-20,5466213,Kim,Gabriela's place was absolutely fantastic. It...
2,36720,16302574,2014-07-25,2062882,Juan R.,"Quiet place, open to a courtyard, with all the..."
3,36720,16428874,2014-07-27,1225618,David,The best Airbnb expeierence I've had. The apar...
4,36720,19478358,2014-09-13,13977301,Cal,"All first rate ! Beautiful apartment, comforta..."
...,...,...,...,...,...,...
175557,40083597,565231820,2019-11-17,182269688,Alfredo,"Such a great place to stay, a subway station r..."
175558,40101052,566367508,2019-11-19,227603695,Leonie,Wir haben uns super wohl gefühlt in dem gemütl...
175559,40156948,567272148,2019-11-22,228192763,Eva,Nino was a great host and the apartment is fan...
175560,40216780,566990264,2019-11-21,12741283,Matteo,The host canceled this reservation 22 days bef...


In [105]:
len(df_reviews.listing_id.unique()) # those IDs which have been booked, the rest not?

8927

### Step 2: Assess Data

### Step 3: Clean Data

### Step 4: Analyze Data

### Step 5: Model Data

### Step 5: Visualize Data