---
<center><h1>Basic intro into pandas</h1></center> 

<center><h2>Work with pandas DataFrames: main operations, sorting and selecting by type</h2></center>

---

## Table of Contents
- [Work with pandas DataFrames: main operations, sorting and selecting by type](#Work-with-pandas-DataFrames:-main-operations,-sorting-and-selecting-by-type)
    * [Flexible comparisons and boolean reductions](#Flexible-comparisons-and-boolean-reductions)
    * [Descriptive statistics](#Descriptive-statistics)
    * [Function application](#Function-application)
    * [Sorting](#Sorting)
    * [Selecting by type](#Selecting-by-type)
    - [*Exercise 1*](#Exercise-1)

In [225]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: main operations, sorting and selecting by type

[[back to top]](#Table-of-Contents)

In this part we will consider the following questions:
1.	how quickly compare two or more DataFrames or check if Dataframe’s items satisfy any condition.
2.	what main mathematical (computational) and statistical operations may be easily applied to pandas DataFrame's data, i.e. what such operations are build in pandas; 
3.	how to apply an arbitrary function to DataFrame’s items, rows, columns and whole DataFrame and change its data type;
4.	how sort rows and columns data;
5.	how select any column by its type.

At first, let’s find all unique dates in `‘release_date’` column of `movies` and then select only `dates` in range lower `1995`.

In [227]:
movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")
movies['release_date'] = movies['release_date'].map(pd.to_datetime)

In [228]:
# get unique values
unique_dates = movies['release_date'].drop_duplicates().dropna()
unique_dates

0       1997-01-24
117     1993-01-01
309     1994-01-01
409     1997-07-11
455     1986-01-01
           ...    
99938   1986-04-26
99940   1998-03-06
99958   1996-09-18
99967   1996-02-28
99977   1997-04-30
Name: release_date, Length: 240, dtype: datetime64[ns]

In [229]:
# find dates with year lower/equal than 1995
unique_dates_1 = list(filter(lambda x: x.year <= 1995, unique_dates))
unique_dates_1

[Timestamp('1993-01-01 00:00:00'), Timestamp('1994-01-01 00:00:00'), Timestamp('1986-01-01 00:00:00'), Timestamp('1987-01-01 00:00:00'), Timestamp('1979-01-01 00:00:00'), Timestamp('1995-01-01 00:00:00'), Timestamp('1990-01-01 00:00:00'), Timestamp('1971-01-01 00:00:00'), Timestamp('1978-01-01 00:00:00'), Timestamp('1988-01-01 00:00:00'), Timestamp('1995-10-30 00:00:00'), Timestamp('1991-01-01 00:00:00'), Timestamp('1992-01-01 00:00:00'), Timestamp('1995-08-14 00:00:00'), Timestamp('1966-01-01 00:00:00'), Timestamp('1954-01-01 00:00:00'), Timestamp('1962-01-01 00:00:00'), Timestamp('1989-01-01 00:00:00'), Timestamp('1980-01-01 00:00:00'), Timestamp('1969-01-01 00:00:00'), Timestamp('1952-01-01 00:00:00'), Timestamp('1974-01-01 00:00:00'), Timestamp('1973-01-01 00:00:00'), Timestamp('1984-01-01 00:00:00'), Timestamp('1985-01-01 00:00:00'), Timestamp('1970-01-01 00:00:00'), Timestamp('1981-01-01 00:00:00'), Timestamp('1967-01-01 00:00:00'), Timestamp('1933-01-01 00:00:00'), Timestamp('19

Here we have used `drop_duplicates()` method to select only `unique` Series values. Then we can filter `movies` with respect to `release_date` condition. Each `datetime` Python object possesses with attributes `year`, `month`, `day`, etc. allowing to extract values of year, month, day, etc. from the date. We call the new DataFrame as `old_movies`.

In [230]:
old_movies = movies[movies['release_date'].isin(unique_dates_1)]
old_movies.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
117,196,393,4,881251863,49.0,M,writer,55105,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
118,22,393,4,878886989,25.0,M,writer,40206,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
119,244,393,3,880607365,28.0,M,technician,80525,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
120,298,393,4,884183099,44.0,M,executive,1581,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
121,286,393,4,877534481,27.0,M,student,15217,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
122,200,393,4,884129410,40.0,M,programmer,93402,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
123,210,393,3,891035904,39.0,M,engineer,3060,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
124,303,393,4,879484981,19.0,M,student,14853,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
125,194,393,2,879524007,,M,administrator,2154,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
126,291,393,3,875086235,19.0,M,student,44106,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we may filter DataFrame `old_movies` by `age` and `rating`. Lets’ drop `timestamp`, `zip_code`

In [231]:
# get all users with age less than 25 that rated old movies higher than 3
old_movies_watch = old_movies[(old_movies['age'] < 25) & (old_movies['rating'] > 3)] 
old_movies_watch = old_movies_watch.drop(['timestamp', 'zip_code'],axis=1)
old_movies_watch.head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
124,303,393,4,19.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
135,276,393,4,21.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
153,128,393,4,24.0,F,marketing,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
162,130,393,5,20.0,M,none,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
183,314,393,4,20.0,F,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
190,363,393,4,20.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
192,373,393,4,24.0,F,other,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
202,405,393,4,22.0,F,healthcare,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
205,416,393,4,20.0,F,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
215,471,393,5,10.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


We will use the last DataFrame in the future also. So, let’s begin.

In [232]:
df_ABC = pd.DataFrame({'A': [1,2,3], 'B': [3,4,5], 'C': [-1,9,-4]})
df_ABC

Unnamed: 0,A,B,C
0,1,3,-1
1,2,4,9
2,3,5,-4


In [233]:
df_ACD = pd.DataFrame({'A': [0,4,9], 'C': [-1,-3,-2], 'D': [0,1,-2]})
df_ACD

Unnamed: 0,A,C,D
0,0,-1,0
1,4,-3,1
2,9,-2,-2


In [234]:
df_ABC.le(df_ACD)

Unnamed: 0,A,B,C,D
0,False,False,True,False
1,True,False,False,False
2,True,False,True,False


As was mentioned above pandas compare elements from the same row and column. 

You can also apply the reductions: `empty`, `any()`, `all()`, and `bool()` to provide a way to summarize a boolean result:

In [235]:
# here vertical direction for comparison is taking into account and we check all column’s items
(df_ACD < 0).all()

A    False
C     True
D    False
dtype: bool

In [236]:
# here horizontal direction for comparison is taking into account and we check all row’s items
(df_ACD < 0).all(axis=1)

0    False
1    False
2    False
dtype: bool

In [237]:
# here vertical direction for comparison is taking into 
# account and we check if just one column’s item satisfies the condition
(df_ACD < 0).any()

A    False
C     True
D     True
dtype: bool

In [238]:
# here we check if all DataFrame's items satisfy the condition
(df_ACD < 0).any().any()

True

In [242]:
# here we check if DataFrame is empty (no elements)
df_ACD.empty

False

Based on the provided above way you can determine the necessary columns with respect to any condition. It’s helpful when need to quickly check if a DataFrame or its some row or columns contain, for instance, all positive values but it does not matter exactly what the elements – it is the main difference between filtering and flexible comparisons.  Remember you can reverse a boolean condition by using the not keyword.

### Descriptive statistics

[[back to top]](#Table-of-Contents)

pandas provides a large number of methods for computing descriptive statistics and other related mathematical operations on Series and DataFrame. Most of these are aggregations but some of them produce an object of the same size. Most of these functions are collected in summary table of common functions:

|Function|Description|
|--|-------------------------------|
|abs|absolute value|
|count|number of non-null observations|
|sum|sum of values|
|mean|mean of values|
|mad|mean absolute deviation|
|median|arithmetic median of values|
|min|minimum value|
|max|maximum value|
|idxmin|position of minimum value|
|idxmax|position of maximum value|
|mode|mode|
|prod|product of values|
|std|unbiased standard deviation|
|var|unbiased variance|
|cumsum|cumulative sum (a sequence of partial sums of a given sequence)|

Let’s demonstrate how you can use these methods:

In [243]:
old_movies_watch['age'].sum()

170223.0

In [244]:
old_movies_watch['age'].mean()

20.791865152070354

In [245]:
# returns average value for each column  
old_movies_watch.mean()



user_id        476.556370
movie_id       349.991572
rating           4.408575
age             20.791865
unknown          0.000000
Action           0.271406
Adventure        0.148162
Animation        0.051057
Childrens        0.087944
Comedy           0.291193
Crime            0.096372
Documentary      0.006840
Drama            0.390741
Fantasy          0.009894
Film-Noir        0.017222
Horror           0.058507
Musical          0.059118
Mystery          0.032246
Romance          0.206303
Sci-Fi           0.146207
Thriller         0.206913
War              0.100647
Western          0.026139
dtype: float64

In [246]:
# average value for all DataFrame
old_movies_watch.mean().mean()

37.12849108608026

In [247]:
old_movies_watch['age'].max(), old_movies_watch['age'].idxmax()

(24.0, 153)

Remember that we have had the troubles when have tried to count sum of `'precipitation'` column of `movies` in the previous post and independently replaced `null` values. Using above methods we do not need to think about it. 

### Function application

[[back to top]](#Table-of-Contents)

pandas allows to apply your own or some library’s function to pandas objects (particularly, Series and DataFrame). If you need to apply any function to DataFrame row or column you may use the function apply. When you need to make something transformations with some column’s or row’s elements, then method `map` will be helpful (it works like pure Python function `map()`). But there is also possibility to apply some function to each DataFrame element (not to a column or a row) – method `applymap` comes to the aid in this case.

For instance, we could find the average value of each column of `old_movies_watch` DataFrame in such way

In [23]:
old_movies_watch.loc[:, (old_movies_watch.dtypes == np.int64) | (old_movies_watch.dtypes == np.float64)].apply(np.mean)

user_id        476.556370
movie_id       349.991572
rating           4.408575
age             20.791865
unknown          0.000000
Action           0.271406
Adventure        0.148162
Animation        0.051057
Childrens        0.087944
Comedy           0.291193
Crime            0.096372
Documentary      0.006840
Drama            0.390741
Fantasy          0.009894
Film-Noir        0.017222
Horror           0.058507
Musical          0.059118
Mystery          0.032246
Romance          0.206303
Sci-Fi           0.146207
Thriller         0.206913
War              0.100647
Western          0.026139
dtype: float64

or of each row (let’s remind the attribute axis define the horizontal `(axis=1)` or vertical direction for calculations `(axis=0)`)

In [24]:
old_movies_watch.loc[:, (old_movies_watch.dtypes == np.int64) | (old_movies_watch.dtypes == np.float64)]. \
                 apply(np.mean, axis=1).head(10)

124    31.304348
135    30.217391
153    23.913043
162    23.869565
183    31.826087
190    33.956522
192    34.565217
202    35.869565
205    36.260870
215    38.260870
dtype: float64

or find the absolute value of the difference between maximal and minimal values multiplied by elements amount in the corresponding row

In [249]:
old_movies_watch.loc[:, (old_movies_watch.dtypes == np.int64) | (old_movies_watch.dtypes == np.float64)].apply(lambda x: abs(x.max() - x.min())*x.count())

user_id         7712154.0
movie_id       13123761.0
rating             8187.0
age              139179.0
unknown               0.0
Action             8187.0
Adventure          8187.0
Animation          8187.0
Childrens          8187.0
Comedy             8187.0
Crime              8187.0
Documentary        8187.0
Drama              8187.0
Fantasy            8187.0
Film-Noir          8187.0
Horror             8187.0
Musical            8187.0
Mystery            8187.0
Romance            8187.0
Sci-Fi             8187.0
Thriller           8187.0
War                8187.0
Western            8187.0
dtype: float64

You can also apply any your own function set before using method `apply`

In [251]:
def my_own_func(x, power, delta=0):
    if x < 20:
        return (x - delta)**power
    elif x >= 20:
        return round(power/x, 2)
    else:
        return  np.nan
    
old_movies_watch['age'].apply(my_own_func, args=(2,), delta=1).head(10)

124    324.00
135      0.10
153      0.08
162      0.10
183      0.10
190      0.10
192      0.08
202      0.09
205      0.10
215     81.00
Name: age, dtype: float64

where the first argument of `apply` method is the function name, the second are `tuple` of all variables without default values, the follow all variables with default values.

To apply any function to each Series element (row or column of a DataFrame) you may use method `map` (please see the type of `'age'` column before; do you remember how it can be done?)

In [252]:
# get 'age' column where NaN replaced by 0
old_movies_watch['age'].map(lambda x: int(x) if pd.notnull(x) else 0).head(10)

124    19
135    21
153    24
162    20
183    20
190    20
192    24
202    22
205    20
215    10
Name: age, dtype: int64

The same result can be obtained using pure Python

In [253]:
list(map(lambda x: int(x) if pd.notnull(x) else 0,old_movies_watch['age']))[:10]

[19, 21, 24, 20, 20, 20, 24, 22, 20, 10]

but now we have deal  with list without Series.

And just one recipe to get the same result:

In [254]:
old_movies_watch['age'].fillna(0).astype(int).head(10)

124    19
135    21
153    24
162    20
183    20
190    20
192    24
202    22
205    20
215    10
Name: age, dtype: int32

Here we have used method `astype()` to change type of column’s elements. But why we have written `fillna(0)`?

### Sorting

[[back to top]](#Table-of-Contents)

pandas functionality proposes two kinds of very fast sorting: sorting by label using `sort_index()` and sorting by actual values `order()` for Series and `sort()` for DataFrame. Let’s note that both sorting procedures don’t return a new object by default, except by passing attribute `inplace=True`. For applying of `sort()` method to a DataFrame you should set an arbitrary vector or a column name of the DataFrame to determine the sort order. Otherwise `sort()` works as well as `sort_index()`. By default pandas return an object in ascending order. For changing it to descending order you should set attribute `ascending=False`.


In [257]:
old_movies_watch.sort_index().head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
124,303,393,4,19.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
135,276,393,4,21.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
153,128,393,4,24.0,F,marketing,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
162,130,393,5,20.0,M,none,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
183,314,393,4,20.0,F,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
190,363,393,4,20.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
192,373,393,4,24.0,F,other,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
202,405,393,4,22.0,F,healthcare,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
205,416,393,4,20.0,F,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
215,471,393,5,10.0,M,student,Mrs. Doubtfire (1993),1993-01-01,http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [256]:
old_movies_watch.sort_index(axis=1).sort_index(ascending=False).head(10)

Unnamed: 0,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMDb_URL,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,age,gender,movie_id,movie_title,occupation,rating,release_date,unknown,user_id
99860,0,0,0,0,0,1,0,0,0,1,0,http://us.imdb.com/M/title-exact?He%20Walked%2...,0,0,0,0,1,0,0,24.0,M,1604,He Walked by Night (1948),technician,4,1948-01-01,0,456
99825,0,0,0,0,1,0,0,0,0,0,0,http://us.imdb.com/M/title-exact?It%20Takes%20...,0,0,0,0,0,0,0,21.0,F,1544,It Takes Two (1995),student,4,1995-01-01,0,705
99769,0,0,0,0,0,0,1,0,0,0,0,"http://us.imdb.com/M/title-exact?Show,%20The%2...",0,0,0,0,0,0,0,24.0,M,1547,"Show, The (1995)",technician,4,1995-01-01,0,456
99749,0,1,0,1,0,0,0,0,0,0,0,http://us.imdb.com/M/title-exact?Secret%20Adve...,0,0,0,0,0,0,0,20.0,M,1555,"Secret Adventures of Tom Thumb, The (1993)",student,4,1993-01-01,0,773
99742,0,0,0,0,0,0,0,1,0,0,0,http://us.imdb.com/M/title-exact?Safe%20Passag...,0,0,0,0,0,0,0,22.0,F,1554,Safe Passage (1994),healthcare,4,1994-01-01,0,405
99736,0,1,0,1,0,0,0,0,0,0,0,http://us.imdb.com/M/title-exact?Amazing%20Pan...,0,0,0,0,0,0,0,14.0,F,1540,"Amazing Panda Adventure, The (1995)",student,5,1995-01-01,0,887
99693,0,1,0,1,0,0,0,0,0,0,0,http://us.imdb.com/M/title-exact?Far%20From%20...,0,0,0,0,0,0,0,19.0,M,1531,Far From Home: The Adventures of Yellow Dog (1...,student,4,1995-01-01,0,393
99644,0,0,0,0,0,1,0,1,0,0,0,http://us.imdb.com/M/title-exact?New%20Jersey%...,0,0,0,0,0,0,0,20.0,F,1519,New Jersey Drive (1995),student,4,1995-01-01,0,314
99626,0,0,0,0,0,0,0,1,0,0,0,http://us.imdb.com/M/title-exact?Losing%20Isai...,0,0,0,0,0,0,0,20.0,F,1518,Losing Isaiah (1995),student,4,1995-01-01,0,314
99566,0,0,0,0,0,0,0,1,0,0,0,http://us.imdb.com/M/title-exact?Wedding%20Gif...,0,0,0,0,0,0,0,20.0,F,1516,"Wedding Gift, The (1994)",student,5,1994-01-01,0,416


In [36]:
#sort by 'user_id' and 'movie_id' with 'movie_id' ascending with NaNs in beginning
old_movies_watch.sort_values(['user_id', 'movie_id'], ascending=[0, 1], na_position='first').head(10)

Unnamed: 0,user_id,movie_id,rating,age,gender,occupation,movie_title,release_date,IMDb_URL,unknown,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
30852,943,2,5,22.0,M,student,GoldenEye (1995),1995-01-01,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,...,0,0,0,0,0,0,0,1,0,0
64600,943,11,4,22.0,M,student,Seven (Se7en) (1995),1995-01-01,http://us.imdb.com/M/title-exact?Se7en%20(1995),0,...,0,0,0,0,0,0,0,1,0,0
18728,943,12,5,22.0,M,student,"Usual Suspects, The (1995)",1995-08-14,http://us.imdb.com/M/title-exact?Usual%20Suspe...,0,...,0,0,0,0,0,0,0,1,0,0
91813,943,27,4,22.0,M,student,Bad Boys (1995),1995-01-01,http://us.imdb.com/M/title-exact?Bad%20Boys%20...,0,...,0,0,0,0,0,0,0,0,0,0
53198,943,28,4,22.0,M,,Apollo 13 (1995),1995-01-01,http://us.imdb.com/M/title-exact?Apollo%2013%2...,0,...,0,0,0,0,0,0,0,1,0,0
11456,943,31,4,22.0,M,student,Crimson Tide (1995),1995-01-01,http://us.imdb.com/M/title-exact?Crimson%20Tid...,0,...,0,0,0,0,0,0,0,1,1,0
82380,943,41,4,22.0,M,student,Billy Madison (1995),1995-01-01,http://us.imdb.com/M/title-exact?Billy%20Madis...,0,...,0,0,0,0,0,0,0,0,0,0
43465,943,42,5,22.0,M,student,Clerks (1994),1994-01-01,http://us.imdb.com/M/title-exact?Clerks%20(1994),0,...,0,0,0,0,0,0,0,0,0,0
33886,943,50,4,22.0,M,student,Star Wars (1977),1977-01-01,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,...,0,0,0,0,0,1,1,0,1,0
38483,943,54,4,22.0,M,student,Outbreak (1995),1995-01-01,http://us.imdb.com/M/title-exact?Outbreak%20(1...,0,...,0,0,0,0,0,0,0,1,0,0


Here the first argument represent `list` of DataFrame’s columns, the seconds one denotes sorting order for corresponding column and the last one defines the position where null values will be placed. 

And let’s give an example of Series sorting:

In [38]:
old_movies_watch['occupation'].sort_values()

59774    administrator
5592     administrator
93968    administrator
21233    administrator
45047    administrator
21228    administrator
95121    administrator
74516    administrator
93886    administrator
20891    administrator
56433    administrator
27496    administrator
45915    administrator
44165    administrator
64646    administrator
54214    administrator
33627    administrator
18781    administrator
91058    administrator
53720    administrator
87757    administrator
81095    administrator
47270    administrator
28115    administrator
47545    administrator
47554    administrator
28363    administrator
75495    administrator
81627    administrator
21615    administrator
             ...      
91303              NaN
91773              NaN
91878              NaN
92069              NaN
92114              NaN
92128              NaN
92459              NaN
92847              NaN
93004              NaN
93039              NaN
93325              NaN
93806              NaN
93910      

Let’s note that previous pandas versions (before 0.17.0) contain other method for sorting by values: `sort_values(inplace=True)` for Series and `sort_values(by=[“column’s name”])` for DataFrame.

It is important to note that Series has the `nsmallest()` and `nlargest()` methods which return the smallest or largest `n` values. For a large Series this can be much faster than sorting the entire Series and calling `head(n)` on the result.


In [39]:
old_movies_watch['age'].nlargest(3)

153    24.0
192    24.0
268    24.0
Name: age, dtype: float64

In [40]:
old_movies_watch['age'].nsmallest(5)

22667    7.0
24343    7.0
30925    7.0
34339    7.0
36820    7.0
Name: age, dtype: float64

### Selecting by type

[[back to top]](#Table-of-Contents)

You already know how to see types of each column of a DataFrame (with the help of `dtypes`, for example) and how to change type of any DataFrames’s column or row (by using `astype()` method). But what to do if you need to select a specific column of a certain type? Method `select_dtypes()` makes this issue very easy. Let’s create a DataFrame with data of many different types to demonstrate its work


In [41]:
import datetime
types_df = pd.DataFrame({  'int': list(range(3)),
                           'float': [1.1, 2.2, 3.3],
                           'bool': [False, True, False],
                           'string': list('abc'),
                           'undefined': [2>1, pd.isnull(np.inf),isinstance([],list)],
                           'shuffled': [datetime.datetime.now(), [np.nan, np.inf], type('A')],
                           'date': pd.date_range('20151120', periods=3).values
                        })
types_df

Unnamed: 0,int,float,bool,string,undefined,shuffled,date
0,0,1.1,False,a,True,2019-04-23 14:36:18.177086,2015-11-20
1,1,2.2,True,b,False,"[nan, inf]",2015-11-21
2,2,3.3,False,c,True,<class 'str'>,2015-11-22


In [42]:
types_df.dtypes

int                   int64
float               float64
bool                   bool
string               object
undefined              bool
shuffled             object
date         datetime64[ns]
dtype: object

Pay attention that pandas defines Python type str as type object. 

Let’s select only boolean columns


In [43]:
types_df.select_dtypes(include=['bool'])   
# or types_df.select_dtypes(include=[bool])

Unnamed: 0,bool,undefined
0,False,True
1,True,False
2,False,True


or remain all columns which are have no bool or object types

In [44]:
types_df.select_dtypes(exclude=['bool', 'object']) 
# or types_df.select_dtypes(include=['datetime64[ns]','float64', 'int64'])


Unnamed: 0,int,float,date
0,0,1.1,2015-11-20
1,1,2.2,2015-11-21
2,2,3.3,2015-11-22


> ### Exercise 3.1

> - In the `old_movies_watch` DataFrame find all columns with type `'object'`.

> - Write the function which will extract the released year from `movie_title` and convert this value to `float` type if `gender = "M"` or return how much years leave to 2000 otherwise. If the movie title does not contain a year, then the function should return zero. Apply this function to `movies` DataFrame, calculate the average value of this new Series and write result to `avg` variable.

> - Sort in place the `old_movies_watch` DataFrame’s rows in ascending order by `release_date`. If there are many rows with the same `release_date` value, they should be sorted by the `user_id` in descending order. `Null` values should be placed in the bottom part of the table.

In [47]:
# To check the correctness of your answers we are using the "data/movies.csv".
# So, if use have changed the `movies` DataFrame in some way, please read "data/movies.csv" again before continuing.
movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")
movies['release_date'] = movies['release_date'].map(pd.to_datetime)
object_old = old_movies_watch.select_dtypes(include=['bool'])
# print movies.dtypes
# type your code here
def my_own_func(x, power, delta=0):
    if x < 20:
        return (x - delta)**power
    elif x >= 20:
        return round(power/x, 2)
    else:
        return  np.nan
    
#old_movies_watch['age'].apply(my_own_func, args=(2,), delta=1).head(10)
#unique_dates_1 = filter(lambda x: x.year <= 1995, unique_dates)
#old_movies = movies[movies['release_date'].isin(unique_dates_1)]
list_gender = movies['gender'].tolist()
list_titles = movies['movie_title'].tolist()
print (list_titles[:5])
get_years = [i.split("(")[-1].split(")")[0] for i in list_titles]
print (get_years[:5])
def prep(l1,l2,l3):
    list1 = []
    for i in range(len(l1)):
        if l2[i]=='M' and l3[i]!='unknown' and l3[i]!='V':
            k = float(l3[i])
            list1.append(k)
        elif l2[i]=='F' and l3[i]!='unknown' and l3[i]!='V':
            k = int(2000-int(l3[i]))
            list1.append(k)
        else:
            k = 0
            list1.append(k)
    return list1

list2 = prep(list_titles,list_gender,get_years)
print (list2.count(0))
print (sum(list2)/len(list2))

movies['movie_title'] = list2
ap = movies[['movie_title','gender']]
print (ap.head(100))
#print set(get_years) # 'unknown'
avg = round(int(sum(list2)/len(list2)),1)

old_movies_watch = old_movies_watch.sort_values(['release_date', 'user_id'], ascending=[0, 1], na_position='last')

['Kolya (1996)', 'Kolya (1996)', 'Kolya (1996)', 'Kolya (1996)', 'Kolya (1996)']
['1996', '1996', '1996', '1996', '1996']
15
1478.80073
    movie_title gender
0        1996.0      M
1        1996.0      M
2        1996.0      M
3        1996.0      M
4        1996.0      M
5        1996.0      M
6        1996.0      M
7        1996.0      M
8        1996.0      M
9        1996.0      M
10       1996.0      M
11       1996.0      M
12       1996.0      M
13          4.0      F
14       1996.0      M
15       1996.0      M
16       1996.0      M
17       1996.0      M
18          4.0      F
19          4.0      F
20       1996.0      M
21          4.0      F
22          4.0      F
23       1996.0      M
24       1996.0      M
25       1996.0      M
26       1996.0      M
27          4.0      F
28          4.0      F
29       1996.0      M
..          ...    ...
70       1996.0      M
71          4.0      F
72       1996.0      M
73       1996.0      M
74          4.0      F
75       1996

In [76]:
from test_helper import Test

Test.assertEqualsHashed(avg, '7220eca0103ccd072829a98905925e416db2e0e0', 
                             'Incorrect value of "avg"', "Exercise 3.1.1 is successful")
Test.assertEqualsHashed(old_movies_watch, '1cd867325a3481a99897cec7c8624aafa93b61b0', 
                                          'Incorrect content of "old_movies_watch"', "Exercise 3.1.2 is successful")

1 test failed. Incorrect value of "avg"
1 test passed. Exercise 3.1.2 is successful


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>