---
<center><h1>Lesson 2 - Basic intro into pandas</h1></center>
---
---

<center><h2>Part 2. Work with pandas DataFrames: filtering, indexing and missing data</h2></center>
---

## Table of Contents

- [Work with pandas DataFrames: filtering, indexing and missing data](#Work-with-pandas-DataFrames:-filtering,-indexing-and-missing-data)
    * [Get basic information](#Get-basic-information)
    * [Conditional indexing and selection](#Conditional-indexing-and-selection)
    * [Work with indexes and MultiIndex option](#Work-with-indexes-and-MultiIndex-option)
    * [Selection by label and position](#Selection-by-label-and-position)
    * [Work with missing data](#Work-with-missing-data)
    - [*Exercise 2.1*](#Exercise-2.1)

In [1]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: filtering, indexing and missing data

[[back to top]](#Table-of-Contents)

In this part we will continue our acquaintance with DataFrames and will get to know 
1.	how to get basic information about DataFrame and its content;
2.	how to get a segment of a Dataframe and select rows from DataFrame, which satisfy some conditions;
3.	how to change indexes in DataFrame and make advanced indexing;
4.	how to select any rows by its indexes, labels and positions;
5.	how to work with missing data.

Thus, we will divide whole text of this post into logic constructed blocks with respect to mentioned above points. In the following posts we will continue our learning of pandas and will consider its other features.

For our future work let’s choice the [MovieLens dataset](http://grouplens.org/datasets/movielens/), which collects 100K records of rating data sets from the MovieLens web site (http://movielens.org). It consists of three parts. To simplify our future work, we have merge them previously and will use this prepared dataset in further in this lesson.  

In [8]:
movielense = pd.read_csv('data/u.data', sep='\t',engine='python', names=['user_id', 'movie_id', 'rating', 'timestamp'])
movieuser = pd.read_csv('data/u.user', sep='|', engine='python', names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])
movieitem = pd.read_csv('data/u.item', sep='|', engine='python', 
                        names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 'IMDb_URL', 
                               'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 
                               'Documentary' , 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 
                               'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'])

movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")
movies['release_date'] = movies['release_date'].map(pd.to_datetime)

movies.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


### Get basic information

[[back to top]](#Table-of-Contents)

pandas has a set of functions for getting basic information about DataFrame:

let’s see the type of `movielense`

In [9]:
type(movies)

pandas.core.frame.DataFrame

Yes, DataFrame does not coincide with any Python data structure!
Then lets take a look on type of `movielense` columns

In [10]:
movies.dtypes

user_id                  int64
movie_id                 int64
rating                   int64
timestamp                int64
age                    float64
gender                  object
occupation              object
zip_code                object
movie_title             object
release_date    datetime64[ns]
IMDb_URL                object
unknown                  int64
Action                   int64
Adventure                int64
Animation                int64
Childrens                int64
Comedy                   int64
Crime                    int64
Documentary              int64
Drama                    int64
Fantasy                  int64
Film-Noir                int64
Horror                   int64
Musical                  int64
Mystery                  int64
Romance                  int64
Sci-Fi                   int64
Thriller                 int64
War                      int64
Western                  int64
dtype: object

You can also see basic statistics about the DataFrame’s numeric columns

In [11]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 30 columns):
user_id         100000 non-null int64
movie_id        100000 non-null int64
rating          100000 non-null int64
timestamp       100000 non-null int64
age             93731 non-null float64
gender          100000 non-null object
occupation      93806 non-null object
zip_code        100000 non-null object
movie_title     100000 non-null object
release_date    99991 non-null datetime64[ns]
IMDb_URL        99987 non-null object
unknown         100000 non-null int64
Action          100000 non-null int64
Adventure       100000 non-null int64
Animation       100000 non-null int64
Childrens       100000 non-null int64
Comedy          100000 non-null int64
Crime           100000 non-null int64
Documentary     100000 non-null int64
Drama           100000 non-null int64
Fantasy         100000 non-null int64
Film-Noir       100000 non-null int64
Horror          100000 non-null int64
Musi

Method `info()` shows (top down)
+ that `movielense` is an instance of DataFrame’s class; this information we have obtained with help of function `type()`;
+ number of rows in DataFrame;
+ type of each column and number of non-null rows in this column; this information in a shorted view was given by `dtypes`;
+ memory size of the DataFrame etc.
method `describe()` allows to quickly get average, minimal and maximal values, standard deviation etc. in each DataFrame column with numeric items

In [12]:
movies.describe()

Unnamed: 0,user_id,movie_id,rating,timestamp,age,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,100000.0,100000.0,100000.0,100000.0,93731.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0,32.9665,0.0001,0.25589,0.13753,0.03605,0.07182,...,0.01352,0.01733,0.05317,0.04954,0.05245,0.19461,0.1273,0.21872,0.09398,0.01854
std,266.61442,330.798356,1.125674,5343856.0,11.561809,0.01,0.436362,0.344408,0.186416,0.258191,...,0.115487,0.130498,0.224373,0.216994,0.222934,0.395902,0.33331,0.41338,0.291802,0.134894
min,1.0,1.0,1.0,874724700.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,254.0,175.0,3.0,879448700.0,24.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,447.0,322.0,4.0,882826900.0,30.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,682.0,631.0,4.0,888260000.0,40.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,943.0,1682.0,5.0,893286600.0,73.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Conditional indexing and selection

[[back to top]](#Table-of-Contents)

As we said above DataFrame is a group of Series objects. This allows you to select specific column (a Series) from the DataFrame (in this case you get a Series) or a few columns (in this case you get another DataFrame)

In [13]:
movies_rating = movies['rating']
# Here we are showing only one column, i.e. a Series
print ('type:', type(movies_rating))
movies_rating.head()

type: <class 'pandas.core.series.Series'>


0    3
1    5
2    4
3    4
4    3
Name: rating, dtype: int64

In [14]:
movies_user = movies[['age', 'gender', 'occupation']]
# Here we are showing three columns, i.e. a new DataFrame
print ('type:', type(movies_user))
movies_user.tail()

type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,age,gender,occupation
99995,17.0,M,student
99996,,M,student
99997,17.0,M,student
99998,28.0,M,writer
99999,27.0,M,engineer


You can also refer to one column in such way

In [15]:
movies.user_id

0        196
1        305
2          6
3        234
4         63
5        181
6        201
7        249
8         13
9        279
10       145
11        90
12       271
13        18
14         1
15       207
16        14
17       113
18       123
19       296
20       154
21       270
22       240
23       144
24        21
25       239
26       111
27       129
28       131
29       226
        ... 
99970    894
99971    747
99972    747
99973    751
99974    762
99975    782
99976    782
99977    782
99978    782
99979    782
99980    782
99981    839
99982    870
99983    880
99984    782
99985    782
99986    782
99987    787
99988    828
99989    896
99990    835
99991    840
99992    851
99993    851
99994    854
99995    863
99996    863
99997    863
99998    896
99999    916
Name: user_id, Length: 100000, dtype: int64

Filtered DataFrames can be obtained by using of logic operators

In [16]:
# Let's display only men
movies_user_gender_male = movies[movies['gender'] == 'M']
movies_user_gender_male.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


In [17]:
#Getting records with `age` larger 40 and `occupation` 'writer', 'student' and 'programmer'
job_range = ['writer', 'student', 'programmer']
filtered_df_1 = movies[(movies['age'] > 40 ) & (movies['occupation'].isin(job_range))]
filtered_df_1.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
117,196,393,4,881251863,49.0,M,writer,55105,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
173,144,393,4,888105743,53.0,M,programmer,20910,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
197,389,393,2,880088717,44.0,F,writer,83702,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
226,506,393,3,874874802,46.0,M,programmer,3869,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
263,694,393,3,875728952,60.0,M,programmer,6365,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
309,196,381,4,881251728,49.0,M,writer,55105,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,0,1,0,0,0,0
348,379,381,5,885063301,44.0,M,programmer,98117,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,0,1,0,0,0,0
364,503,381,5,880383174,50.0,F,writer,27514,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,0,1,0,0,0,0
409,196,251,3,881251274,49.0,M,writer,55105,Shall We Dance? (1996),1997-07-11,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#Records with `age` between 30 and 50, not null release_date and rating 5 
filtered_df_2 = movies [((movies ['age'] > 30) & (movies ['age'] < 50)) | (movies['release_date'].isnull()) \
                        | (movies['rating'] == 5)][['user_id','movie_id','rating','age','gender','occupation','movie_title']]
filtered_df_2.head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title
0,196,242,3,49.0,M,writer,Kolya (1996)
1,305,242,5,23.0,M,programmer,Kolya (1996)
2,6,242,4,42.0,M,executive,Kolya (1996)
4,63,242,3,31.0,M,marketing,Kolya (1996)
7,249,242,5,25.0,M,student,Kolya (1996)
8,13,242,2,47.0,M,educator,Kolya (1996)
9,279,242,3,33.0,M,programmer,Kolya (1996)
10,145,242,5,31.0,M,entertainment,Kolya (1996)
13,18,242,5,35.0,F,other,Kolya (1996)
14,1,242,5,,M,,Kolya (1996)


In the previous examples we have used method `isin(range)` for checking the presence of Series items in range, method `isnull()` for define `null` (`NaN`) values and boolean operators `&` (`AND`) and `|` (`OR`) in complicated conditions.
As you can see after filtering result tables (i.e. DataFrames) have non-ordered indexes. To fix this trouble you may write the following:

In [19]:
filtered_df_2.reset_index().head(10)

Unnamed: 0,index,user_id,movie_id,rating,age,gender,occupation,movie_title
0,0,196,242,3,49.0,M,writer,Kolya (1996)
1,1,305,242,5,23.0,M,programmer,Kolya (1996)
2,2,6,242,4,42.0,M,executive,Kolya (1996)
3,4,63,242,3,31.0,M,marketing,Kolya (1996)
4,7,249,242,5,25.0,M,student,Kolya (1996)
5,8,13,242,2,47.0,M,educator,Kolya (1996)
6,9,279,242,3,33.0,M,programmer,Kolya (1996)
7,10,145,242,5,31.0,M,entertainment,Kolya (1996)
8,13,18,242,5,35.0,F,other,Kolya (1996)
9,14,1,242,5,,M,,Kolya (1996)


to start indexing form 0 and regularize it.

Let’s remind that you can add new columns and rows to the DataFrame:

In [20]:
#set new custom_score column and fill it with 0
filtered_df_1['custom_score'] = 0
filtered_df_1.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,custom_score
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
117,196,393,4,881251863,49.0,M,writer,55105,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
173,144,393,4,888105743,53.0,M,programmer,20910,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
197,389,393,2,880088717,44.0,F,writer,83702,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
226,506,393,3,874874802,46.0,M,programmer,3869,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
263,694,393,3,875728952,60.0,M,programmer,6365,Mrs. Doubtfire (1993),1993-01-01,...,0,0,0,0,0,0,0,0,0,0
309,196,381,4,881251728,49.0,M,writer,55105,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,1,0,0,0,0,0
348,379,381,5,885063301,44.0,M,programmer,98117,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,1,0,0,0,0,0
364,503,381,5,880383174,50.0,F,writer,27514,Muriel's Wedding (1994),1994-01-01,...,0,0,0,0,1,0,0,0,0,0
409,196,251,3,881251274,49.0,M,writer,55105,Shall We Dance? (1996),1997-07-11,...,0,0,0,0,0,0,0,0,0,0


### Work with indexes and MultiIndex option

[[back to top]](#Table-of-Contents)

Pandas allows to set specific indexes to a DataFrame. It can be defined at creating of a DataFrame:

In [22]:
import random
indexes = [random.randrange(0,100) for i in range(5)]
data = [{i:random.randint(0,10) for i in 'ABCDE'} for i in range(5)]
df = pd.DataFrame(data, index=indexes)
df

Unnamed: 0,A,B,C,D,E
87,4,6,6,8,10
12,1,8,1,1,4
78,9,10,10,8,9
53,5,4,7,4,6
68,9,2,6,9,2


Or be change any time

In [23]:
df.index = ['a', 'b', 'c', 'd', 'e']
df

Unnamed: 0,A,B,C,D,E
a,4,6,6,8,10
b,1,8,1,1,4
c,9,10,10,8,9
d,5,4,7,4,6
e,9,2,6,9,2


There is the possibility to select any column (one or more) as index column

In [25]:
# drop 'timestamp' duplicates to get unique values
movies_user_gender_male = movies_user_gender_male.drop_duplicates(subset='timestamp', keep='last')
# set 'timestamp' as index
movies_user_gender_male = movies_user_gender_male.set_index('timestamp')
movies_user_gender_male.head(10)

Unnamed: 0_level_0,user_id,movie_id,rating,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
875747190,63,242,3,31.0,M,marketing,75240,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
877756647,279,242,3,33.0,M,programmer,85251,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
885844495,271,242,4,51.0,M,engineer,22932,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889751633,1,242,5,,M,,85711,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
890793823,207,242,4,39.0,M,marketing,92037,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
876964570,14,242,4,45.0,M,scientist,55106,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
879141989,195,242,4,42.0,M,scientist,93555,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889041330,40,242,4,38.0,M,scientist,27514,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
880353616,360,242,4,51.0,M,other,98027,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
891546594,440,242,5,30.0,M,other,48076,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0


By default, `set_index()` returns a new DataFrame, so you’ll have to specify if you’d like the changes to occur in place.

Let’s create a many levels index for `filtered_df_2` DataFrame

In [26]:
# set 'user_id' 'movie_id' as index
filtered_df_2_multi = filtered_df_2.set_index(['user_id','movie_id'])
filtered_df_2_multi.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,age,gender,occupation,movie_title
user_id,movie_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
196,242,3,49.0,M,writer,Kolya (1996)
305,242,5,23.0,M,programmer,Kolya (1996)
6,242,4,42.0,M,executive,Kolya (1996)
63,242,3,31.0,M,marketing,Kolya (1996)
249,242,5,25.0,M,student,Kolya (1996)
13,242,2,47.0,M,educator,Kolya (1996)
279,242,3,33.0,M,programmer,Kolya (1996)
145,242,5,31.0,M,entertainment,Kolya (1996)
18,242,5,35.0,F,other,Kolya (1996)
1,242,5,,M,,Kolya (1996)


and see the type of `filtered_df_2_multi.index()`

In [27]:
print ('type: ', type(filtered_df_2_multi.index))

type:  <class 'pandas.core.indexes.multi.MultiIndex'>


Thus, we get a new pandas class MultiIndex, which contains information about indexing of DataFrame and allows manipulating with this data. It’s interesting what is the type of `filtered_df_2.index()`?

You can get levels, labels and names values simply address it as to an attribute

In [28]:
filtered_df_2_multi.index.names 

FrozenList(['user_id', 'movie_id'])

Method `get_level_values()` allows to get all values for the corresponding index level

In [29]:
filtered_df_2_multi.index.get_level_values(0)

Int64Index([196, 305,   6,  63, 249,  13, 279, 145,  18,   1,
            ...
            699, 901, 675, 713, 883, 733, 762, 839, 835, 840],
           dtype='int64', name='user_id', length=48206)

and 

In [30]:
filtered_df_2_multi.index.get_level_values(1)   
# or filtered_df_2_multi.index.get_level_values('movie_id')

Int64Index([ 242,  242,  242,  242,  242,  242,  242,  242,  242,  242,
            ...
            1643, 1643, 1653, 1656, 1656, 1658, 1662, 1664, 1673, 1674],
           dtype='int64', name='movie_id', length=48206)

We will meet pandas MultiIndex in the following posts, particularly, at grouping DataFrames by some column items etc.

### Selection by label and position
[[back to top]](#Table-of-Contents)

After reading previous three subparagraphs probably you have the question: Ok, I know now filter a DataFrame, how make it multi-indexed, but I don’t know how select any specific row in the table, more over how select a row at using MultiIndex?
Object selection in pandas is now supported by three types of multi-axis indexing.

* `.loc` works on labels in the index;
* `.iloc` works on the positions in the index (so it only takes integers);
* `.ix` supports mixed integer and label based access; it usually tries to behave like `.loc` but falls back to behaving like `iloc` if the label is not in the index. 
    
The sequence of the following examples demonstrates how we can manipulate with DataFrame’s rows.
At first let’s get the first row of movies

In [31]:
movies.loc[0]

user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                               

and rows from 1 to 3 (pay attention on setting of ranges in `.loc`, the right boundary is included to this range opposite to Python list and string data structures)

In [32]:
movies.loc[1:3]

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


As you can see the first argument of `.loc` corresponds to index name. If you want return value of specific column(s), you should to define the name of this(these) column(s)

In [33]:
movies.loc[0, 'movie_title']

'Kolya (1996)'

In [34]:
movies.loc[:, ['movie_title', 'rating']].head()

Unnamed: 0,movie_title,rating
0,Kolya (1996),3
1,Kolya (1996),5
2,Kolya (1996),4
3,Kolya (1996),4
4,Kolya (1996),3


Let’s repeat that the first argument of `.loc` is not row number but name of the index for this row

In [35]:
movies_user_gender_male.index

Int64Index([875747190, 877756647, 885844495, 889751633, 890793823, 876964570,
            879141989, 889041330, 880353616, 891546594,
            ...
            892958799, 891037722, 887159554, 891211682, 875731674, 884222085,
            889289491, 889289570, 887160722, 880845755],
           dtype='int64', name='timestamp', length=36289)

In [36]:
movies_user_gender_male.loc[875747190:891546594]

Unnamed: 0_level_0,user_id,movie_id,rating,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
875747190,63,242,3,31.0,M,marketing,75240,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
877756647,279,242,3,33.0,M,programmer,85251,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
885844495,271,242,4,51.0,M,engineer,22932,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889751633,1,242,5,,M,,85711,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
890793823,207,242,4,39.0,M,marketing,92037,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
876964570,14,242,4,45.0,M,scientist,55106,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
879141989,195,242,4,42.0,M,scientist,93555,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889041330,40,242,4,38.0,M,scientist,27514,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
880353616,360,242,4,51.0,M,other,98027,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
891546594,440,242,5,30.0,M,other,48076,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0


But if it is necessary to obtain rows by it number you may use `.iloc`

In [37]:
movies.iloc[0]

user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                               

In [38]:
movies_user_gender_male.iloc[1:5]

Unnamed: 0_level_0,user_id,movie_id,rating,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
877756647,279,242,3,33.0,M,programmer,85251,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
885844495,271,242,4,51.0,M,engineer,22932,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889751633,1,242,5,,M,,85711,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
890793823,207,242,4,39.0,M,marketing,92037,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0


In the first case column’s number coincides with its name. The second example demonstrates the difference between `.loc` and `.iloc`

`.ix` works like a combination of `.loc` and `.iloc` 

In [39]:
movies_user_gender_male.ix[886307828]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


user_id                                                       305
movie_id                                                      690
rating                                                          4
age                                                            23
gender                                                          M
occupation                                             programmer
zip_code                                                    94086
movie_title                           Seven Years in Tibet (1997)
release_date                                  1997-01-01 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Seven+Years+i...
unknown                                                         0
Action                                                          0
Adventure                                                       0
Animation                                                       0
Childrens                                                       0
Comedy    

So you can set both a sequence number and a name of the index in `.ix`.

But let’s note some subtleties that can make `.ix` slightly tricky to use:
if the index is of integer type, `.ix` will only use label-based indexing and not fall back to position-based indexing;
if the index does not contain only integers, then given an integer, `.ix` will immediately use position-based indexing rather than label-based indexing.

But how we can necessary extract data in DataFrame with MultiIndex? It’s very simple: you should set all index levels name as arguments of `.loc` `.ix` like

In [40]:
filtered_df_2_multi.loc[6, 242] 

rating                    4
age                      42
gender                    M
occupation        executive
movie_title    Kolya (1996)
Name: (6, 242), dtype: object

Try to obtain this result using `.iloc`.

Another way to extract slices from an object is with the select method of Series, DataFrame. This method should be used when there is no more direct way. For instance you need select only odd id in `movies_user_gender_male` (we suppose that you are known with lambda expression in Python)

In [41]:
movies_user_gender_male.select(lambda x: x % 2).head(10)

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,user_id,movie_id,rating,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
877756647,279,242,3,33.0,M,programmer,85251,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
885844495,271,242,4,51.0,M,engineer,22932,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
889751633,1,242,5,,M,,85711,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
890793823,207,242,4,39.0,M,marketing,92037,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
879141989,195,242,4,42.0,M,scientist,93555,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
891916883,500,242,3,28.0,M,administrator,94305,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
885168819,520,242,5,62.0,M,healthcare,12603,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
888817735,532,242,4,,M,student,92705,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
884698095,533,242,4,43.0,M,librarian,02324,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0
875997093,594,242,4,46.0,M,educator,M4J2K,Kolya (1996),1997-01-24,http://us.imdb.com/M/title-exact?Kolya%20(1996),...,0,0,0,0,0,0,0,0,0,0


Using above methods you can easily filter a DataFrame by index values:

In [42]:
movies_user_gender_male[movies_user_gender_male.index > 893280000]

Unnamed: 0_level_0,user_id,movie_id,rating,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
893286584,683,588,4,42.0,M,librarian,23509,Beauty and the Beast (1991),1991-01-01,http://us.imdb.com/M/title-exact?Beauty%20and%...,...,0,0,0,1,0,0,0,0,0,0
893286259,683,288,3,42.0,M,librarian,23509,Scream (1996),1996-12-20,http://us.imdb.com/M/title-exact?Scream%20(1996),...,0,0,1,0,0,0,0,1,0,0
893286373,729,338,1,19.0,M,student,56567,Bean (1997),1997-01-01,http://us.imdb.com/M/title-exact?Bean+(1997),...,0,0,0,0,0,0,0,0,0,0
893286364,683,56,5,42.0,M,librarian,23509,Pulp Fiction (1994),1994-01-01,http://us.imdb.com/M/title-exact?Pulp%20Fictio...,...,0,0,0,0,0,0,0,0,0,0
893283641,683,880,3,42.0,M,librarian,23509,Soul Food (1997),1997-01-01,http://us.imdb.com/M/title-exact?Soul+Food+(1997),...,0,0,0,0,0,0,0,0,0,0
893282978,683,258,3,42.0,M,librarian,23509,Contact (1997),1997-07-11,http://us.imdb.com/Title?Contact+(1997/I),...,0,0,0,0,0,0,1,0,0,0
893286204,729,310,3,19.0,M,student,56567,"Rainmaker, The (1997)",1997-01-01,"http://us.imdb.com/M/title-exact?Rainmaker,+Th...",...,0,0,0,0,0,0,0,0,0,0
893286501,683,317,4,,M,librarian,23509,In the Name of the Father (1993),1993-01-01,http://us.imdb.com/M/title-exact?In%20the%20Na...,...,0,0,0,0,0,0,0,0,0,0
893286502,683,609,3,42.0,M,librarian,23509,Father of the Bride (1950),1950-01-01,http://us.imdb.com/M/title-exact?Father%20of%2...,...,0,0,0,0,0,0,0,0,0,0
893286168,729,346,1,19.0,M,student,56567,Jackie Brown (1997),1997-01-01,http://us.imdb.com/M/title-exact?imdb-title-11...,...,0,0,0,0,0,0,0,0,0,0


In [43]:
filtered_df_2_multi[(filtered_df_2_multi.index.get_level_values(0) > 900) &
                    (filtered_df_2_multi.index.get_level_values(1) > 1000)]

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,age,gender,occupation,movie_title
user_id,movie_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
928,1007,5,21.0,M,student,Waiting for Guffman (1996)
937,1007,4,48.0,M,educator,Waiting for Guffman (1996)
936,1007,5,,M,other,Waiting for Guffman (1996)
923,1277,5,,M,student,Set It Off (1996)
939,1277,5,26.0,F,student,Set It Off (1996)
907,1016,5,25.0,F,other,Con Air (1997)
902,1016,2,45.0,F,artist,Con Air (1997)
927,1016,5,23.0,M,programmer,Con Air (1997)
935,1016,4,42.0,M,doctor,Con Air (1997)
938,1016,3,38.0,F,technician,Con Air (1997)


### Work with missing data

[[back to top]](#Table-of-Contents)

Pandas primarily uses the value `np.nan` to represent missing data (in table missed/empty value are marked by `NaN`). It is by default not included in computations. Missing data creates many issues at mathematical or computational tasks with DataFrames and Series and it’s important to know how fight with these values.

Previously we have learned how to check `null` and `non-null` values in the DataFrame and Series and how to miss `null` row in the table. But what to do if we need to use rows with `null` data, for example, find sum of all values in the dataset?

Let’s try do this


In [44]:
ages = movies['age']
sum(ages)

nan

The result is unexpected because there many `non-null` values in `movies['age']` Series. Sure, we could filter `movies['age']`  and remain only `non-null` values. But what if we need sum all numerical values in `movies`? This way will be powerless or too complicated, because we will drop all row items even there is only one `null` value in this row. You can try to do this yourself.

To solve the assigned task you may use an elegant pandas method `fillna(value)`, which replace all `null` values by value.


In [45]:
ages = movies['age'] .fillna(0)
sum(ages)

3089983.0

In [46]:
movies_fillna = movies.fillna(0)
movies_fillna.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24 00:00:00,...,0,0,0,0,0,0,0,0,0,0


Thus, we replace all `NaN` items to `0`. If `inplace=True` in `fillna()` method, then a DataFrame renew.
   
To remain only rows with `non-null` values you can use method `dropna()`

In [47]:
movies_fillna = movies.dropna(0)
movies_fillna.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


We can manipulate by `null` values and columns using parameters subset and how to set analyzing columns and type of analysis respectively

In [48]:
# drop rows where 'age' or 'occupation' values is NaN
filtered_df_2_dropna_1 = filtered_df_2.dropna(how='any',subset=['age','occupation'])
filtered_df_2_dropna_1.head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title
0,196,242,3,49.0,M,writer,Kolya (1996)
1,305,242,5,23.0,M,programmer,Kolya (1996)
2,6,242,4,42.0,M,executive,Kolya (1996)
4,63,242,3,31.0,M,marketing,Kolya (1996)
7,249,242,5,25.0,M,student,Kolya (1996)
8,13,242,2,47.0,M,educator,Kolya (1996)
9,279,242,3,33.0,M,programmer,Kolya (1996)
10,145,242,5,31.0,M,entertainment,Kolya (1996)
13,18,242,5,35.0,F,other,Kolya (1996)
15,207,242,4,39.0,M,marketing,Kolya (1996)


In [49]:
# drop rows where 'age' and 'occupation' values is NaN
filtered_df_2_dropna_1 = filtered_df_2.dropna(how='all',subset=['age','occupation'])
filtered_df_2_dropna_1.head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title
0,196,242,3,49.0,M,writer,Kolya (1996)
1,305,242,5,23.0,M,programmer,Kolya (1996)
2,6,242,4,42.0,M,executive,Kolya (1996)
4,63,242,3,31.0,M,marketing,Kolya (1996)
7,249,242,5,25.0,M,student,Kolya (1996)
8,13,242,2,47.0,M,educator,Kolya (1996)
9,279,242,3,33.0,M,programmer,Kolya (1996)
10,145,242,5,31.0,M,entertainment,Kolya (1996)
13,18,242,5,35.0,F,other,Kolya (1996)
15,207,242,4,39.0,M,marketing,Kolya (1996)


Thus, if `how='all'`, we get DataFrame, where all values in both columns from subset are `NaN`, and if `how='any'`, we get `DataFrame`, where at least one contains `NaN`.

> ### Exercise 2.1

> - Get type of `“age”` column in `movies`. 

> - In `movies` find all rows where `release_date` corresponds to the `1995` year, where `age` is less `25` and with `female gender` and `not-null` `occupation`. Call the obtained DataFrmae as `df_1995`. Here the [`datetime` module](https://docs.python.org/2/library/datetime.html) may be helpful for you.

> - Create a new DataFrame from `movies` by indexing the last one with levels `movie_id`, `rating` and `gender`. Then select rows with `movie_id=242`, `rating=5` and `gender='M'` and write results to `indexed` variable.

> - In the DataFrame created in the previous step select all rows with rating values between 2 and 4 and movie_id larger than 1000. Select only non-null values, i.e. those rows where no one recors is not `NaN`. Count the records amount and write result to the `amount` variable.

In [51]:
# type your code here
print (movies.age.dtypes)
# To check the correctness of your answers we are using the "data/movies.csv".
# So, if use have changed the `movies` DataFrame in some way, please read "data/movies.csv" again before continuing.
movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")
#movies['release_date'] = movies['release_date'].map(pd.to_datetime)
movies['release_date'] = pd.to_datetime(movies['release_date'])
#print movies.head(5)
print (movies.release_date.dtypes)

df_1995 = movies[(movies['release_date'].dt.year==1995)&(movies['age']<25.0)&(movies['gender']=='F')].dropna(0)
#print df_1995.head(5)
indexed_prep = movies[['movie_id','rating','gender']]
indexed = indexed_prep[(indexed_prep['movie_id']==242)&(indexed_prep['rating']==5)&(indexed_prep['gender']=='M')].dropna(0)
print (indexed.head(5))
new_indexed = indexed[(indexed_prep['rating']>2)&(indexed_prep['rating']<4)&(indexed_prep['movie_id']>1000)].dropna(0)
print (new_indexed.head(5))
amount = len(new_indexed)

float64
datetime64[ns]
    movie_id  rating gender
1        242       5      M
7        242       5      M
10       242       5      M
14       242       5      M
25       242       5      M
Empty DataFrame
Columns: [movie_id, rating, gender]
Index: []


  app.launch_new_instance()


In [82]:
from test_helper import Test

Test.assertEqualsHashed(df_1995, '12cc22430c99dd2da1588b582886552b6b5e16bb', 
                                 'Incorrect content of "df_1995"', "Exercise 2.1.1 is successful")
Test.assertEqualsHashed(indexed, 'a08b60ec26ae293cc3230e007e7d377848797426', 
                                 'Incorrect content of "indexed"', "Exercise 2.1.2 is successful")
Test.assertEqualsHashed(amount, 'bc1a73ffba838f9263e05db6eefe1bf5d7cf636e', 
                                'Incorrect value of "amount"', "Exercise 2.1.3 is successful")

1 test passed. Exercise 2.1.1 is successful
1 test failed. Incorrect content of "indexed"
1 test failed. Incorrect value of "amount"


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>