![airbnb](https://kaggle2.blob.core.windows.net/competitions/kaggle/4651/logos/front_page.png)

# Airbnb New User Bookings
### Exploratory Data Analysis, part 1

In this notebook, we'll do some data quality checking and data preprocessing using a dataset from the
**Airbnb New User Bookings** [Kaggle competition](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings).

## 1. Get the data

The dataset for this assignment is part of a [Kaggle recruiting competition](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings), and comprises [five data files](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data), as follows:

    age_gender_bkts.csv - users' age group, gender, country of destination
    countries.csv       - destination countries in this dataset and their locations
    sessions.csv        - log of user web sessions
    test_users.csv      - test set of users 
    train_users.csv     - training set of users

## 1. Business Understanding

### Determine Business Objectives

Airbnb is private company that operates an online travel website where people can list, find, and book rental lodgings. According to [Wikipedia](https://en.wikipedia.org/wiki/Airbnb), the site currently has over 1.5 million listings in 190 countries.
The business objective of this data mining activity is to enable Airbnb to
* Personalize content to its customers,
* Reduce the time it takes for a new customer to book their first Airbnb experience, and
* Better forecast demand.

According to Airbnb's [Terms of Service](https://www.airbnb.com/terms), the site is "intended solely for persons who are 18 or older."

### Determine Data Mining Goals

The goal of the data mining effort is to **predict which country a new user will choose to book their first Airbnb experience,** at a 95% confidence level.

Because the Kaggle competition will close prior to the end of the semester, it will not be possible to submit an entry for consideration. However, we will produce a solution according to the posted requirements and, if possible, submit it for scoring. The [submission file](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/details/evaluation) must be a CSV file with a header row and two columns (id and country). A perfect model will return a score of 1.0, so an effective prediction algorithm will return a score that is high on a scale of 0 to 1.

## Part 2. Data Understanding

### Codebook
This section summarizes and describes the meaning and type of data (scale, values, etc.) for every attribute in each of the five original data files. The descriptions also include some basic statistics, such as the range of values, scale, minimums, maximums, and so forth, as appropriate.

#### age_gender_buckets.csv (1 of 5)
This file contains demographic data about 10 destination countries. The fields include country identifier, age group, gender, and population for the combination of age group and gender. The year when the data were collected is also included.
There are 420 records. The table below describes the variables in this data file:

Pos.|Variable|Type|Description|Range
---:|---|:---:|---|---
1|age_bucket|String|Used as a categorical variable to designate age 5-year age groups. (e.g. 25–29 years old)|Every 5-year age range from "0–4" to "95–99", plus a "100+" category (21 unique values)
2|country_destination|String|Two-letter ISO-3166-1 alpha-2 country code|"AU" (Australia), "CA" (Canada), "DE" (Germany), "ES" (Spain), "FR" (France), "GB" (United Kingdom), "IT" (Italy), "NL" (Netherlands), "PT" (Portugal), and "US" (United States) (10 unique values)
3|gender|String|Gender of user|"male" and "female" (2 unique values)
4|population_in_thousands|Numeric (integer)|Population, in thousands, of people in the destination country of a specific gender and age range. (Although these values are represented in the data file as floating point numbers, none have a non-zero decimal component.)|Min: 0<br>Max:11,601
5|year|Numeric (integer)|Year in which the demographic data was observed (assumed; no information provided)|2015 (all records same)


#### countries.csv (2 of 5)
This data file contains additional descriptive information about the destination countries, identified by two-character codes. For each of 10 destination countries, the data file also includes the approximate location of the country (latitude and longitude), its distance from the United States, the country's geographic size, the primary language spoken there, and a measure of how different that language is from English. There are 10 records in the data file. The following table lists the variables in this data file:

Pos.|Variable|Type|Description|Range/Values
---:|---|:---:|---|---
1|country_destination|String|Two-letter ISO-3166-1 alpha-2 country code|Same as in age_gender_buckets.csv
2|lat_destination|Numeric (signed float)|Location of approximate geographic center of the destination country (degrees latitude; negative values indicate south latitudes)|Range: -90 to 90<br>Min: -26.85<br>Max: 62.39
3|lng_destination|Numeric (signed float)|Location of approximate geographic center of the destination country (degrees longitude; negative values indicate west longitudes)|Range: -180 to 180<br>Min: -96.82<br>Max: 133.28
4|distance_km|Numeric (float)|Great circle distance (in kilometers) between of the United States and the destination country|Range: 0 to 20,020<br>Min: 0.0<br>Max: 15,297.7<br>Mean: 7,181.9<br>Median: 7,603.6
5|destination_km2|Numeric (integer)|Total area (in square kilometers) of the destination country|Range: 0 to 510,100,000<br>Min: 41,543<br>Max: 9,826,675<br>Mean: 2,973,734<br>Median: 431,196
6|destination_language|String|Primary language spoken in the destination country|"eng" (English), "deu" (German), "spa" (Spanish), "fra" (French), "ita" (Italian), "nld" (Dutch), "por" (Portuguese) (7 unique values)
7|language_levenshtein_distance|Numeric (float)|The Levenshtein Distance measures the amount of difference between two sequences, usually of letters or words. In this dataset, it appears to represent the amount of difference between the primary language of the destination country and English. However, it is not clear what the values mean (scale or units) or what constitutes a "substantial" difference.|Min: 0.0<br>Max: 95.45<br>Mean: 50.50<br>Median: 67.92

#### train_users.csv (3 of 5) and 
This data file contains information about the users. All users are from The United States.

Pos.|Variable|Type|Description|Range/Values
---:|---|:---:|---|---
1|id|String|Alphanumeric string representing a unique user|231,451 unique values (each record in data file is unique)
2|date_account_created|Date|Date that the user created their Airbnb account|First: 01/01/2010<br>Last: 06/30/2014<br>(1,634 unique dates)
3|timestamp_first_active|Date/time|Ostensibly, the date and time of the user's first activity on the site. See "Data Quality" note 2.|First: 01/01/1970 05:34:50<br>Last: 01/01/1970 05:35:40
4|date_first_booking|Date|Date of the user's first booking. Some records do not contain a value (NaN), which indicates that the user did not book a trip.|First: 01/02/2010<br>Last: 06/29/2015<br>(88,908 records with values)
5|gender|String|The user's gender|"-unknown-", "MALE", "FEMALE", "OTHER" (4 unique values)
6|age|Numeric (integer)|The user's age. See "Data Quality" note 3.|Min: 1<br>Max:2,014<br>Null values: 87,990
7|signup_method|String|The method used by the user to sign up for their Airbnb account.|"basic", "facebook", or "google" (3 unique values)
8|signup_flow|Numeric (integer)|Used as a categorical variable to identify the referring page at the time the user signed up|Some (but not all) integers in the range 0 to 25 (17 unique values)
9|language|String|The user's "primary" language, as two-character (ISO-639-1) codes. See "Data Quality" note 4.|'en' (English), 'fr' (French), 'de' (German), 'es' (Spanish), 'it' (Italian), 'pt' (Portuguese), 'zh' (Chinese), 'ko' (Korean), 'ja' (Japanese), 'ru' (Russian), 'pl' (Polish), 'el' (Greek), 'sv' (Swedish), 'nl' (Dutch), 'hu' (Hungarian), 'da' (Danish), 'id' (Indonesian), 'fi' (Finnish), 'no' (Norwegian), 'tr' (Turkish), 'th' (Thai), 'cs' (Czech), 'hr' (Croatian), 'ca' (Catalan), or 'is' (Icelandic) (25 unique values)
10|affiliate_channel|String|Marketing channel from which the user was directed when they signed up|'direct', 'seo', 'other', 'sem-non-brand', 'content', 'sem-brand', 'remarketing', or 'api' (8 unique values)
11|affiliate_provider|String|Marketer that directed user to Airbnb|'direct', 'google', 'other', 'craigslist', 'facebook', 'vast', 'bing', 'meetup', 'facebook-open-graph', 'email-marketing', 'yahoo', 'padmapper', 'gsp', 'wayn', 'naver', 'baidu', 'yandex', or 'daum' (18 unique values)
12|first_affiliate_tracked|String|First marketing the user interacted with before the signing up|'untracked', 'omg', nan, 'linked', 'tracked-other', 'product', 'marketing', or 'local ops' (8 unique values)
13|signup_app|String|The app platform that the user used to create their account|'Web', 'Moweb', 'iOS', or 'Android'
14|first_device_type|String|The device type that the user used to create their account|'Mac Desktop', 'Windows Desktop', 'iPhone', 'Other/Unknown', 'Desktop (Other)', 'Android Tablet', 'iPad', 'Android Phone', or 'SmartPhone (Other)'
15|first_browser|String|Browser the user used to create their account|'Chrome', 'IE', 'Firefox', 'Safari', '-unknown-', 'Mobile Safari', 'Chrome Mobile', 'RockMelt', 'Chromium', 'Android Browser', 'AOL Explorer', 'Palm Pre web browser', 'Mobile Firefox', 'Opera', 'TenFourFox', 'IE Mobile', 'Apple Mail', 'Silk', 'Camino', 'Arora', 'BlackBerry Browser', 'SeaMonkey', 'Iron', 'Sogou Explorer', 'IceWeasel', 'Opera Mini', 'SiteKiosk', 'Maxthon', 'Kindle Browser', 'CoolNovo', 'Conkeror', 'wOSBrowser', 'Google Earth', 'Crazy Browser', 'Mozilla', 'OmniWeb', 'PS Vita browser', 'NetNewsWire', 'CometBird', 'Comodo Dragon', 'Flock', 'Pale Moon', 'Avant Browser', 'Opera Mobile', 'Yandex.Browser', 'TheWorld Browser', 'SlimBrowser', 'Epic', 'Stainless', 'Googlebot', 'Outlook 2007', or 'IceDragon' (52 unique values)
16|country_destination|String|Country of user's first booking|'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL', 'DE', 'AU', 'NDF' (No destination found), or 'other' (12 unique values)

#### test_users_2.csv (4 of 5)
Same as **train_users.csv**, except that the **country_destination** field is not present.

#### sessions.csv (5 of 5)

The sessions file contains a log of activities of selected visitors to the Airbnb site. Each record represents a single interaction—a page view, for example—by a user with the site. There are 10,567,737 records in the data file. 

Pos.|Variable|Type|Description|Range/Values
---:|---|:---:|---|---
1|user_id|String|Alphanumeric string representing a unique user|135,484 unique user IDs
2|action|String|The specific interaction a user had with the site.|There are 360 unique values, so they are not all listed here. The most common type is "show", which occurs over 2.7 million times. Other top values include "index", "search_results", "personalize", and "search".
3|action_type|String|A more general category of interaction. See "Data Quality" note 5.|nan, 'click', 'data', 'view', 'submit', 'message_post', '-unknown-', 'booking_request', 'partner_callback', 'booking_response', or 'modify' (11 unique values)
4|action_detail|String|A slightly more detailed description of the user's interaction than the "action" field.|There are 156 unique values, the most common of which are "view_search_results", "p3", and "-unknown-".
5|device_type|String|The type of device that the user used for this interaction.|'Windows Desktop', '-unknown-', 'Mac Desktop', 'Android Phone', 'iPhone', 'iPad Tablet', 'Android App Unknown Phone/Tablet', 'Linux Desktop', 'Tablet', 'Chromebook', 'Blackberry', 'iPodtouch', 'Windows Phone', or 'Opera Phone' (14 unique values)
6|secs_elapsed|Numeric (integer)|Seconds elapsed between this interaction and the one preceding it (for this user)|Min: 0<br>Max: 1,799,977<br>Mean: 19,405.8<br>Median: 1,147

### 2.2 Initial Data Exploration and Manipulation (Individual Data Files)
In this section, we begin to explore the data files **separately** in order to understand what's going in with each of the attributes, to identify potential problems, and apply data manipulations (such as excluding observations or imputing data. While the steps in this section are commented, please refer to the "Data Quality", "Data Imputation", and "Data Consolidation" sections for more complete explanations.

In [1]:
# Load libraries
import pandas as pd
import numpy as np

#### age_gender_bkts.csv

In [2]:
# read in the csv file
df_agb = pd.read_csv('age_gender_bkts.csv')

In [3]:
# Show all null values
df_agb[df_agb.isnull().any(axis=1)]

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year


No null values!

In [5]:
# Take a peek at the data
print (df_agb.head(10))
print (df_agb.tail())

  age_bucket country_destination gender  population_in_thousands    year
0       100+                  AU   male                      1.0  2015.0
1      95-99                  AU   male                      9.0  2015.0
2      90-94                  AU   male                     47.0  2015.0
3      85-89                  AU   male                    118.0  2015.0
4      80-84                  AU   male                    199.0  2015.0
5      75-79                  AU   male                    298.0  2015.0
6      70-74                  AU   male                    415.0  2015.0
7      65-69                  AU   male                    574.0  2015.0
8      60-64                  AU   male                    636.0  2015.0
9      55-59                  AU   male                    714.0  2015.0
    age_bucket country_destination  gender  population_in_thousands    year
415      95-99                  US    male                    115.0  2015.0
416      90-94                  US    male   

In [6]:
df_agb.head()

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0
1,95-99,AU,male,9.0,2015.0
2,90-94,AU,male,47.0,2015.0
3,85-89,AU,male,118.0,2015.0
4,80-84,AU,male,199.0,2015.0


In [7]:
# Another way I like is to use the sample() method, which
# lets me see several records at random. I can repeat this several times
# or call sample() with a larger number to get a "wider" sense of the dataset.
df_agb.sample(5)

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
64,55-59,CA,female,1305.0,2015.0
346,90-94,PT,female,45.0,2015.0
235,0-4,GB,female,1888.0,2015.0
307,10-14,NL,male,517.0,2015.0
221,60-64,GB,female,1775.0,2015.0


In [8]:
# Let's examine the types. We want to know whether the types are appropriate for the data.
df_agb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 5 columns):
age_bucket                 420 non-null object
country_destination        420 non-null object
gender                     420 non-null object
population_in_thousands    420 non-null float64
year                       420 non-null float64
dtypes: float64(2), object(3)
memory usage: 16.5+ KB


Looks like we might want to make some changes.

In [9]:
# The first three variables look like catetoricals. Let's explore those further.
df_agb['age_bucket'].value_counts()

80-84    20
70-74    20
0-4      20
60-64    20
45-49    20
5-9      20
40-44    20
15-19    20
75-79    20
35-39    20
65-69    20
20-24    20
100+     20
85-89    20
25-29    20
10-14    20
95-99    20
30-34    20
50-54    20
55-59    20
90-94    20
Name: age_bucket, dtype: int64

In [10]:
# Nice and neat. Let's factorize those.
df_agb['age_bucket'] = df_agb['age_bucket'].astype('category')

In [11]:
# Now the age_bucket attribute is a categorical with these values (categories):
df_agb['age_bucket'].cat.categories

Index(['0-4', '10-14', '100+', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '5-9', '50-54', '55-59', '60-64', '65-69', '70-74',
       '75-79', '80-84', '85-89', '90-94', '95-99'],
      dtype='object')

In [12]:
# We can tell these categories have a definite order to them, so we need to fix that.
df_agb['age_bucket'] = df_agb['age_bucket'].cat.set_categories([
    '0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
    '40-44', '45-49',  '50-54', '55-59', '60-64', '65-69', '70-74',
    '75-79', '80-84', '85-89', '90-94', '95-99', '100+'], ordered=True)

In [13]:
df_agb['age_bucket'].cat.categories

Index(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39',
       '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79',
       '80-84', '85-89', '90-94', '95-99', '100+'],
      dtype='object')

In [14]:
# Let's look at country_destination
df_agb['country_destination'].value_counts()

CA    42
FR    42
ES    42
AU    42
NL    42
GB    42
IT    42
DE    42
US    42
PT    42
Name: country_destination, dtype: int64

In [15]:
# (So far, this seems like a nice clean dataset.)
# These are nominals, so we don't need to worry about order this time.
df_agb['country_destination'] = df_agb['country_destination'].astype('category')

In [16]:
# Now let's look at gender.
df_agb['gender'].value_counts()

female    210
male      210
Name: gender, dtype: int64

In [17]:
# No surprises found, so let's convert to nominal
df_agb['gender'] = df_agb['gender'].astype('category')

In [18]:
# Now let's review
df_agb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420 entries, 0 to 419
Data columns (total 5 columns):
age_bucket                 420 non-null category
country_destination        420 non-null category
gender                     420 non-null category
population_in_thousands    420 non-null float64
year                       420 non-null float64
dtypes: category(3), float64(2)
memory usage: 9.1 KB


In [19]:
df_agb.sample(5)

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
118,15-19,DE,female,1974.0,2015.0
133,45-49,ES,male,1909.0,2015.0
272,5-9,IT,male,1473.0,2015.0
148,70-74,ES,male,880.0,2015.0
307,10-14,NL,male,517.0,2015.0


Should `population_in_thousands` really be an integer? Since the number represents population _in thousands_, it is reasonable that it be represented as a float. (If you were to examine the file, you'd find that none of the values has a fractional component, but there's no good reason not to keep it as a float.)

In [20]:
df_agb['population_in_thousands'].describe()

count      420.000000
mean      1743.133333
std       2509.843202
min          0.000000
25%        396.500000
50%       1090.500000
75%       1968.000000
max      11601.000000
Name: population_in_thousands, dtype: float64

Could the presence of zeros indicate missing values?

In [21]:
# Minimum value of 0.0 is a red flag. What's going on here?
df_agb[df_agb['population_in_thousands'] == 0]

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
328,100+,NL,male,0.0,2015.0
358,100+,PT,male,0.0,2015.0


If we think about what this data table is telling us, we realize this is perfectly legitimate.
We're going to leave `population_in_thousands` as is.

Now let's look at `year`.

In [22]:
df_agb[df_agb['year'].isnull()]

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year


That's weird. Why did those come in as floats rather than integers?

The answer is that the year attribute actually includes the decimal place in the data file!

In [23]:
# Ordinarily, we would convert year to integer...
df_agb['year'] = df_agb['year'].astype('int')

In [24]:
# ... but since every record has the same value for year,
# it's not really useful. We can simply remove it.
del df_agb['year']

In [None]:
# One last look:
df_agb.sample(5)

---

That was a pretty clean dataset, so we didn't have much work to do.
Let's look at a slightly more difficult dataset now.


### train_users.csv

In [None]:
# read in the csv file
df_train_users = pd.read_csv('train_users_2.csv')

In [None]:
df_train_users.info()

In [None]:
# Let's take a peek at the data
df_train_users.sample(5)

The `id` attribute is just an arbitrary string, so we can leave that alone.

Looks like we have some date attributes. We can convert those to datetime types.

In [None]:
# convert select fields to datetime object
df_train_users['date_account_created'] = pd.to_datetime(df_train_users['date_account_created'])
df_train_users['timestamp_first_active'] = pd.to_datetime(df_train_users['timestamp_first_active'])
df_train_users['date_first_booking'] = pd.to_datetime(df_train_users['date_first_booking'])

In [None]:
df_train_users['date_account_created'].describe()

It doesn't look like there are any missing or extreme values.

In [None]:
# If we want, we can look at several attributes at once by
# referencing a list instead of a single attribute name.
df_train_users[['timestamp_first_active', 'date_first_booking']].describe()

Now we see something suspicious! January 1, 1970 (or sometimes December 31, 1969) is a meaningful date -- the beginning of the Unix epoch. The output above tells us that ALL the values are some time on Jan 1, 1970.
We know this is a bogus date, so we don't have much choice but to drop the attribute.

In [None]:
del df_train_users['timestamp_first_active']

Did you notice that only 88,908 rows have a value in the `date_first_booking` attribute? Is this a problem?

No, it's not. The reason is that it is totally reasonable that some users have not yet booked a first trip using Airbnb. This is important information.

---

In [None]:
# Let's look at the numeric attributes.
# Turns out there's only one! (age)
df_train_users['age'].describe()

In [None]:
# set precision for displaying number in a nice format
pd.options.display.float_format = '{:.3f}'.format
np.set_printoptions(precision=3, suppress=True)

In [None]:
# Right away we see problems. 2014 is probably not a user's age!
np.sort(df_train_users['age'].unique())

Obviously, there are some bogus values here. Let's drill down.

In [None]:
# The Airbnb terms of service state that users must be at least 18 years old.
print ("Number of users less than 18 years old :",  sum(df_train_users.age < 18))

# We feel (quite arbitrarily) that ages greater than 95 are realistic.
print ("Number of users over 95 years old :",  sum(df_train_users.age > 95))

In [None]:
# Fill the records in which age < 18 or age > 95 with the NaN
df_train_users.loc[df_train_users.age > 95, 'age'] = np.nan
df_train_users.loc[df_train_users.age < 18, 'age'] = np.nan

In [None]:
# Show basic statistics for age (BEFORE IMPUTATION)
print("The mean age is %.2f" % df_train_users['age'].mean())
print("The median age is %.2f" % df_train_users['age'].median())
print("There are %d null age values." % len(df_train_users[df_train_users['age'].isnull()]))

There are 90,586 records with no "age" value. In addition, there are many observations with values that don't make sense. Minimum age is 1 and maximum age is 2014. There seems to be a some data quality issues with age, where the user has been able to enter an arbitrary age. We will impute using median values, grouped by `country_destination`,
`gender`, and `signup_app`.

In [None]:
# Impute those with the median age, grouped by gender and signup app.
print ("-" * 60)
print ("Performing imputation now")
df_train_users['age'].fillna(df_train_users.groupby(by=['country_destination','gender','signup_app'])['age'].transform("median"),inplace=True)
print ("Imputation complete.")
print ("-" * 60)

In [None]:
# Show basic statistics for age (AFTER IMPUTATION)
print("The mean age is %.2f" % df_train_users['age'].mean())
print("The median age is %.2f" % df_train_users['age'].median())
print("There are %d null age values." % len(df_train_users[df_train_users['age'].isnull()]))

In [None]:
# Show the leftovers
df_train_users[df_train_users['age'].isnull()]

This user has no group. Since the dataset is quite large, we can afford to simply remove this row.

In [None]:
df_train_users.drop(df_train_users.index[185728], inplace=True)

In [None]:
# Confirm:
df_train_users[df_train_users['age'].isnull()]

---

Now let's deal with the categorical variables

In [None]:
# Let's make a list of the categorical attributes
user_cats = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel',
            'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type',
            'first_browser', 'country_destination']

In [None]:
# Let's examine the values and counts for each categorical variable.
for c in user_cats:
    print("----------------")
    print(c)
    print(df_train_users[c].value_counts())

Each user has exactly one "international language preference". At the time of writing, Airbnb permits users to have multiple languages (or no languages) associated with their profiles. We shall assume that the value in this field indicate a user's "primary" language.

Probably not much else to be done now. You may wish to convert the categoricals to the `category` type, as we did previously in the `age_gender_buckets` dataset.

---

Let's do some feature engineering.

To start, it may be useful to categorize users based on age group. One approach would be to use the
groupings in the age_gender_buckets table. But let's consider are more coarse-grained approach instead,
which may lend itself easier analysis. Let's try:

* 18 to 40 ("younger" folks)
* 41 to 60 ("middle age" folks)
* over 60  ("old" folks)

In [None]:
# Break up the age variable and create a new age_range variable
df_train_users['age_range'] = pd.cut(df_train_users['age'], # The series to cut up
                                     bins=[18,41,61,1e6],   # list of cut points
                                     labels=['from_18_to_40','from_41_to_60','Over_60'])
# this creates a new variable

In [None]:
df_train_users.sample(5)

---

### Notes

* There is an error in the header row of `countries.csv`. There is an errant space at the end
of the `destination_language` header that you'll need to remove by hand. The space may cause some
pandas methods to break.

---

---

---

---

---