# Introduction to Pandas

## Learning Goals 
The goal of the Business Analytics exercise is to **teach all steps necessary to solve a predictive data analytics task** using machine learning/neural networks. As the basis for any machine learning is data, in this exercise, we will look into how to load and work with tabular data. 

For this, we use a CSV (comma-separated values) file that contains information about books from [Goodreads](https://www.goodreads.com). In this exercise, we will clean this data and further parse it.  We will do some exploratory data analysis to answer questions about these books and popular genres. 

After this introductory exercise for the Python package Pandas, you will feel more comfortable:

- Loading and working with tabular data. 
- Getting a first overview over the data in a Pandas Dataframe.

## Importing modules
All notebooks should begin with code that imports *packages*, collections of built-in, commonly-used Python functions.  Below we import the Matplotlib package, a library for plotting images, lines, graphs, ...  Future exercises will require additional modules, which we'll import with the same syntax.

`import MODULE_NAME as MODULE_NICKNAME` 

In the following, we can import the package named Pandas and we give it a nickname ```pd```

In [4]:
import pandas as pd

## Loading and Cleaning with Pandas 
Pandas is a Python package that allows you to work with dataframes. Dataframes are two-dimensional arrays and look just like Excel-sheets. In fact, Pandas provides functionality to read Excel files as well. However, the files in our exercises come in so-called CSV (comma-separated values) format. You can simply open them in a text editor and have a look. 

A CSV file ```file.csv``` can be read into a variable ```df``` (denoting a Pandas dataframe) using 

```
df = pd.read_csv(file.csv)
```. 

### Exercise: Loading data

1. Download or checkout the csv file `goodreads.csv` from github. Upload it to Google colab, if you are using this tool; For this, you can click on "Files" or "Dateien" on the left side and then click on "Upload". The file should then be available under "data/goodreads.csv", relative to the Jpupyter notebook.
2. Load the file using the function ```pd.read_csv``` and store it into the variable ```df```.

In [7]:
#1.

#2.
df = pd.read_csv('data/goodreads.csv')

Here is a description of the columns (in order) present in this csv file:

|column|description|
|---|---|
|rating| the average rating on a 1-5 scale achieved by the book|
|review_count| the number of Goodreads users who reviewed this book|
|isbn| the ISBN code for the book|
|booktype| an internal Goodreads identifier for the book|
|author_url| the Goodreads (relative) URL for the author of the book|
|year| the year the book was published|
|genre_urls| a string with '|' separated relative URLS of Goodreads genre pages|
|dir| a directory identifier internal to the scraping code|
|rating_count| the number of ratings for this book (this is different from the number of reviews)|
|name| the name of the book|
 

Let us see what issues we find with the data and resolve them. For this, you can simple type in ```print(df)```.

In [10]:
print(df)

      4.40   136455  0439023483  good_reads:book  \
0     4.41  16648.0  0439358078  good_reads:book   
1     3.56  85746.0  0316015849  good_reads:book   
2     4.23  47906.0  0061120081  good_reads:book   
3     4.23  34772.0  0679783261  good_reads:book   
4     4.25  12363.0  0446675539  good_reads:book   
...    ...      ...         ...              ...   
5994  4.17   2226.0  0767913736  good_reads:book   
5995  3.99    775.0  1416909427  good_reads:book   
5996  3.78    540.0  1620612321  good_reads:book   
5997  3.91    281.0         NaN  good_reads:book   
5998  4.35     61.0  0786929081  good_reads:book   

     https://www.goodreads.com/author/show/153394.Suzanne_Collins    2008  \
0     https://www.goodreads.com/author/show/1077326....            2003.0   
1     https://www.goodreads.com/author/show/941441.S...            2005.0   
2     https://www.goodreads.com/author/show/1825.Har...            1960.0   
3     https://www.goodreads.com/author/show/1265.Jan...            

Oh dear. That does not quite seem to be right. We are missing the column names. We need to add these in! But what are they?

Here is a list of them in order:

`['rating', 'review_count', 'isbn', 'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count', 'name']`

### Exercise: Load data with column names

1. Use the list of column names to properly read in the CSV file (have a look at the documentation for pd.read_csv to see how this is done - https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [13]:
#1.
df = pd.read_csv('data/goodreads.csv', names=['rating', 'review_count', 'isbn', 'booktype','author_url', 
                                              'year', 'genre_urls', 'dir','rating_count', 'name'])
print(df)

      rating  review_count        isbn         booktype  \
0       4.40      136455.0  0439023483  good_reads:book   
1       4.41       16648.0  0439358078  good_reads:book   
2       3.56       85746.0  0316015849  good_reads:book   
3       4.23       47906.0  0061120081  good_reads:book   
4       4.23       34772.0  0679783261  good_reads:book   
...      ...           ...         ...              ...   
5995    4.17        2226.0  0767913736  good_reads:book   
5996    3.99         775.0  1416909427  good_reads:book   
5997    3.78         540.0  1620612321  good_reads:book   
5998    3.91         281.0         NaN  good_reads:book   
5999    4.35          61.0  0786929081  good_reads:book   

                                             author_url    year  \
0     https://www.goodreads.com/author/show/153394.S...  2008.0   
1     https://www.goodreads.com/author/show/1077326....  2003.0   
2     https://www.goodreads.com/author/show/941441.S...  2005.0   
3     https://www.goodr

## Getting a First Impression

What I like doing first is having a look at the shape of the dataframe, i.e., the number of observations (rows) and the number of variables (columns). You can use the functions

- ```df.shape```
- ```len(df)```
- ```len(df.columns)```

for that. This also tells you, how memory intense it is to work with the dataframe. If there are millions of rows, you have to be way more careful to pick efficient functions when altering the dataframe.

Afterwards, you can have a look at the first and last rows of the dataframe. These can be obtained by 

- ```df.head()```
- ```df.tail()```

### Exercise: First impression of the dataframe
1. Get an impression about the shape of the dataframe

In [15]:
#1.
print(df.shape)
print(len(df))
print(len(df.columns))

(6000, 10)
6000
10


## Cleaning: Examing the dataframe - quick checks

First, we should have a look at the data types of each column. This usually already gives a good impression about what kind of values there appear in a column. You can use ```df.dtypes``` for that.

### Exercise: Data types of columns
1. Have a look at the data types and discuss them. Do they make sense? We will later fix some of the data types.

- float is a floating point number (e.g., 1.342)
- object is used in Pandas for storing strings (e.g., "Steven King")

In [17]:
#1.
print(df.dtypes)

rating          float64
review_count    float64
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count    float64
name             object
dtype: object


## Data Selection

Frequently, you want to select a single column and have some operation on that column. You can access a single columns with ```df[column_name]```. Let's have a look.

In [19]:
df['name'].head(10)

0              The Hunger Games (The Hunger Games, #1)
1    Harry Potter and the Order of the Phoenix (Har...
2                              Twilight (Twilight, #1)
3                                To Kill a Mockingbird
4                                  Pride and Prejudice
5                                   Gone with the Wind
6    The Chronicles of Narnia (Chronicles of Narnia...
7                                      The Giving Tree
8                                          Animal Farm
9    The Hitchhiker's Guide to the Galaxy (Hitchhik...
Name: name, dtype: object

In [20]:
# select 2 columns
# Watch out: if you want multiple columns, you have to pass it as a list
df[['isbn', 'name']].head(7)

Unnamed: 0,isbn,name
0,439023483,"The Hunger Games (The Hunger Games, #1)"
1,439358078,Harry Potter and the Order of the Phoenix (Har...
2,316015849,"Twilight (Twilight, #1)"
3,61120081,To Kill a Mockingbird
4,679783261,Pride and Prejudice
5,446675539,Gone with the Wind
6,66238501,The Chronicles of Narnia (Chronicles of Narnia...


In [21]:
col_a = 'isbn'
col_b = 'name'

df[[col_a, col_b]].tail(7)

Unnamed: 0,isbn,name
5993,345515501.0,"The Silent Girl (Rizzoli & Isles, #9)"
5994,393062260.0,The Book of Psalms
5995,767913736.0,The River of Doubt
5996,1416909427.0,Shug
5997,1620612321.0,Flawed
5998,,Ø£Ø³Ø¹Ø¯ Ø§ÙØ±Ø£Ø© ÙÙ Ø§ÙØ¹Ø§ÙÙ
5999,786929081.0,Legacy of the Drow Collector's Edition (Legacy...


## Indexing
Indexing means that you want to access one or multiple rows. There are two different ways to do that.

- First, you can use the **integer index of the rows** (starting with 0) using **`df.iloc`**. Thereby, the ```n-th``` row is accessed by ```df.iloc[n-1]```. This is quite similar to indexing in lists, which we covered in the last exercise. With `.iloc`, the elements are counted from the top down.
- A second approach is to use the **row labels (i.e., index) using `.loc`**. You have probably seen above that there are these bold numbers at the very left of the output of the dataframe. These are called the index in the world of Pandas. You can select the index ```i``` by calling ```df.loc[i]```. 

In [23]:
# get the first row -> notice: iloc uses the row numbers but not the value of the index
df.loc[0]

rating                                                        4.4
review_count                                             136455.0
isbn                                                   0439023483
booktype                                          good_reads:book
author_url      https://www.goodreads.com/author/show/153394.S...
year                                                       2008.0
genre_urls      /genres/young-adult|/genres/science-fiction|/g...
dir                           dir01/2767052-the-hunger-games.html
rating_count                                            2958974.0
name                      The Hunger Games (The Hunger Games, #1)
Name: 0, dtype: object

In [24]:
#the followint attribute gives us the dimensions of the data frame
df.axes

[RangeIndex(start=0, stop=6000, step=1),
 Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year',
        'genre_urls', 'dir', 'rating_count', 'name'],
       dtype='object')]

At the moment, the integer index (`.iloc`) and the index (`.loc`) lead to the same result, but this will change later.

In [26]:
df.loc[0]

rating                                                        4.4
review_count                                             136455.0
isbn                                                   0439023483
booktype                                          good_reads:book
author_url      https://www.goodreads.com/author/show/153394.S...
year                                                       2008.0
genre_urls      /genres/young-adult|/genres/science-fiction|/g...
dir                           dir01/2767052-the-hunger-games.html
rating_count                                            2958974.0
name                      The Hunger Games (The Hunger Games, #1)
Name: 0, dtype: object

In [27]:
# get the last 5 rows
df.iloc[-5:]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
5995,4.17,2226.0,767913736.0,good_reads:book,https://www.goodreads.com/author/show/44565.Ca...,2005.0,/genres/history|/genres/non-fiction|/genres/bi...,dir60/78508.The_River_of_Doubt.html,16618.0,The River of Doubt
5996,3.99,775.0,1416909427.0,good_reads:book,https://www.goodreads.com/author/show/151371.J...,2006.0,/genres/young-adult|/genres/realistic-fiction|...,dir60/259068.Shug.html,6179.0,Shug
5997,3.78,540.0,1620612321.0,good_reads:book,https://www.goodreads.com/author/show/5761314....,2012.0,/genres/contemporary|/genres/romance|/genres/y...,dir60/13503247-flawed.html,2971.0,Flawed
5998,3.91,281.0,,good_reads:book,https://www.goodreads.com/author/show/1201952....,2006.0,/genres/religion|/genres/islam|/genres/religio...,dir60/2750008.html,3083.0,Ø£Ø³Ø¹Ø¯ Ø§ÙØ±Ø£Ø© ÙÙ Ø§ÙØ¹Ø§ÙÙ
5999,4.35,61.0,786929081.0,good_reads:book,https://www.goodreads.com/author/show/1023510....,2001.0,/genres/fiction|/genres/fantasy|/genres/magic|...,dir60/66677.Legacy_of_the_Drow_Collector_s_Edi...,3982.0,Legacy of the Drow Collector's Edition (Legacy...


### Exercise: Indexing I
1. Do you remember the negativ index? If not, you can discuss it briefly again in your group.

In [29]:
# loc works with the actual values of the index, not with the row number. We will see that this makes a difference later
df.loc[[5995, 5999]]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
5995,4.17,2226.0,767913736,good_reads:book,https://www.goodreads.com/author/show/44565.Ca...,2005.0,/genres/history|/genres/non-fiction|/genres/bi...,dir60/78508.The_River_of_Doubt.html,16618.0,The River of Doubt
5999,4.35,61.0,786929081,good_reads:book,https://www.goodreads.com/author/show/1023510....,2001.0,/genres/fiction|/genres/fantasy|/genres/magic|...,dir60/66677.Legacy_of_the_Drow_Collector_s_Edi...,3982.0,Legacy of the Drow Collector's Edition (Legacy...


## Choose rows by condition

The fun usually starts, when cutting rows out of the dataframe and have a closer look at them. Let's have a look at how to do that.

In [31]:
# check whether book was published after 2010
df['year'] > 2010

0       False
1       False
2       False
3       False
4       False
        ...  
5995    False
5996    False
5997     True
5998    False
5999    False
Name: year, Length: 6000, dtype: bool

You see a Boolean value for each row if the entry in the column ```year``` is greater than 2010. This output is not too informative itself. However, you can store the output and then find out statistics, such as the sum or the mean.

### Exercise: Indexing II

1. Store the above output in a variable ```x```.
2. Determine the data type of ```x```.
3. Search online, how to get the sum and the mean of ```x```. Does the sum and mean make sense on Boolean values? Discuss in the group what these numbers represent.

In [33]:
#1.
x = df['year'] > 2010

#2.
print(x.dtype)
print(type(x))

#3.
print(x.sum())
print(x.mean())

bool
<class 'pandas.core.series.Series'>
1092
0.182



With the previous exercise, we already have a feeling about what kind of books there are in the dataframe in terms of year published. But now, we want to have all the information about these books. We can do this as follows.

In [35]:
# get all entries where the book was published after 2010
#df_after_2010 = df[df['year'] > 2010]
df_after_2010 = df[x] #would lead to the same result

In [36]:
df_after_2010.head(10)

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
26,4.43,112279.0,525478817.0,good_reads:book,https://www.goodreads.com/author/show/1406384....,2012.0,/genres/young-adult|/genres/book-club|/genres/...,dir01/11870085-the-fault-in-our-stars.html,1150626.0,The Fault in Our Stars
43,4.34,82098.0,62024035.0,good_reads:book,https://www.goodreads.com/author/show/4039811....,2011.0,/genres/young-adult|/genres/science-fiction|/g...,dir01/13335037-divergent.html,1127983.0,"Divergent (Divergent, #1)"
163,4.17,43787.0,7442912.0,good_reads:book,https://www.goodreads.com/author/show/4039811....,2012.0,/genres/science-fiction|/genres/dystopia|/genr...,dir02/11735983-insurgent.html,552682.0,"Insurgent (Divergent, #2)"
209,3.69,64489.0,1612130291.0,good_reads:book,https://www.goodreads.com/author/show/4725841....,2011.0,/genres/romance|/genres/adult-fiction|/genres/...,dir03/10818853-fifty-shades-of-grey.html,922131.0,"Fifty Shades of Grey (Fifty Shades, #1)"
256,4.16,13452.0,1442403543.0,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2011.0,/genres/young-adult|/genres/paranormal|/genres...,dir03/6752378-city-of-fallen-angels.html,224643.0,"City of Fallen Angels (The Mortal Instruments,..."
265,4.04,19435.0,61726834.0,good_reads:book,https://www.goodreads.com/author/show/2936493....,2011.0,/genres/young-adult|/genres/science-fiction|/g...,dir03/11614718-delirium.html,229511.0,"Delirium (Delirium, #1)"
282,4.48,11827.0,1416975888.0,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2011.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir03/10025305-clockwork-prince.html,175876.0,"Clockwork Prince (The Infernal Devices, #2)"
284,4.58,15195.0,1406321346.0,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2013.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir03/18335634-clockwork-princess.html,130161.0,"Clockwork Princess (The Infernal Devices, #3)"
299,4.2,19915.0,,good_reads:book,https://www.goodreads.com/author/show/4464118....,2011.0,/genres/romance|/genres/new-adult|/genres/youn...,dir03/11505797-beautiful-disaster.html,271613.0,"Beautiful Disaster (Beautiful, #1)"
329,3.93,23622.0,1612130585.0,good_reads:book,https://www.goodreads.com/author/show/4725841....,2011.0,/genres/romance|/genres/adult-fiction|/genres/...,dir04/11857408-fifty-shades-darker.html,431846.0,"Fifty Shades Darker (Fifty Shades, #2)"


Note, that the bolt index at the very left now does not simply count the rows. The reason for this is, that the dataframe ```df_after_2010``` is only a view on the original dataframe ```df``` (i.e., the values in the dataframe are not copied in the memory).

### Exercise: Indexing III

1. Recall the difference between ```df.loc``` and ```df.iloc``` and select the 5-th row and then the row with index 299 of the dataframe ```df_after_2010```.

In [39]:
#1.

#print(df_after_2010.iloc[0]) 
#print(df_after_2010.loc[0]) #will not be found

print(df_after_2010.iloc[4]) 
#print(df_after_2010.loc[299])
#print(df_after_2010.iloc[299])

#print(df_after_2010.loc[299]) 
#print(df_after_2010.iloc[299])

print(df_after_2010.axes)
#print(df_after_2010.shape)


rating                                                       4.16
review_count                                              13452.0
isbn                                                   1442403543
booktype                                          good_reads:book
author_url      https://www.goodreads.com/author/show/150038.C...
year                                                       2011.0
genre_urls      /genres/young-adult|/genres/paranormal|/genres...
dir                      dir03/6752378-city-of-fallen-angels.html
rating_count                                             224643.0
name            City of Fallen Angels (The Mortal Instruments,...
Name: 256, dtype: object
[Index([  26,   43,  163,  209,  256,  265,  282,  284,  299,  329,
       ...
       5946, 5947, 5972, 5973, 5974, 5975, 5976, 5989, 5993, 5997],
      dtype='int64', length=1092), Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year',
       'genre_urls', 'dir', 'rating_count', 'name'],
     

In [40]:
# Lets see how many books were published after 2010
df_after_2010.shape

(1092, 10)

In [41]:
# Lets get the books that were published after 2010 and which have a rating of 5.0
df_after_2010_rating_equal_5 = df[(df['year'] > 2010) & (df['rating'] == 5.0)]

Now it's becoming more interesting. You can combine the selection criteria. You can combine them as follows:

- ```x & y``` means ```x AND y```. So you tell that you want to have both criterias being True.
- ```x | y``` means ```x OR y```. So you tell that you want at least one of the criterias being True (or both).
- ```~x``` means ```NOT x```. So you tell that you want the criteria ```x``` to be False.

In [43]:
df_after_2010_rating_equal_5

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
1718,5.0,28.0,,good_reads:book,https://www.goodreads.com/author/show/6467808....,2014.0,/genres/poetry|/genres/childrens,dir18/22204746-an-elephant-is-on-my-house.html,64.0,An Elephant Is On My House
2145,5.0,3.0,1300589469.0,good_reads:book,https://www.goodreads.com/author/show/6906561....,2012.0,,dir22/17287259-a-book-about-absolutely-nothing...,63.0,A Book About Absolutely Nothing.
2903,5.0,0.0,983002282.0,good_reads:book,https://www.goodreads.com/author/show/6589034....,2012.0,,dir30/17608096-obscured-darkness.html,8.0,Obscured Darkness (Family Secrets #2)
2909,5.0,0.0,983002215.0,good_reads:book,https://www.goodreads.com/author/show/6589034....,2011.0,,dir30/16200303-family-secrets.html,9.0,Family Secrets
4473,5.0,0.0,,good_reads:book,https://www.goodreads.com/author/show/6896621....,2012.0,,dir45/17259227-patience-s-love.html,7.0,Patience's Love
5564,5.0,9.0,,good_reads:book,https://www.goodreads.com/author/show/7738947....,2014.0,/genres/romance|/genres/new-adult,dir56/21902777-untainted.html,14.0,"Untainted (Photographer Trilogy, #3)"
5692,5.0,0.0,,good_reads:book,https://www.goodreads.com/author/show/5989528....,2012.0,,dir57/14288412-abstraction-in-theory---laws-of...,6.0,Abstraction In Theory - Laws Of Physical Trans...


### Exercise: Selecting data and descriptive statistics

1. Get the books that were published before 1850 or after 2008.
2. Which book was published after 2005 and has a rating lower than 2.5?
3. Calculate the mean and standard deviation of the year.
4. Which formula is implemented in the method `std()` and `var()` of the pandas data frame?
5. Calculate the median and the first and the third quartile.

In [45]:
#1.
df_before_1850_or_after_2008 = df[(df.year < 1850) | (df.year > 2008)]
print(df_before_1850_or_after_2008)

#2.
df_after_2005_rating_below_25 = df[(df.year > 2005) & (df.rating < 2.5)]
print(df_after_2005_rating_below_25)


      rating  review_count        isbn         booktype  \
4       4.23       34772.0  0679783261  good_reads:book   
14      3.72       10156.0  0743477111  good_reads:book   
26      4.43      112279.0  0525478817  good_reads:book   
32      4.44       70247.0  0399155341  good_reads:book   
34      4.07       22713.0  0142437204  good_reads:book   
...      ...           ...         ...              ...   
5986    4.06         954.0  1606840584  good_reads:book   
5989    3.36         192.0  842534607X  good_reads:book   
5991    4.20         650.0         NaN  good_reads:book   
5993    4.09        1256.0  0345515501  good_reads:book   
5997    3.78         540.0  1620612321  good_reads:book   

                                             author_url    year  \
4     https://www.goodreads.com/author/show/1265.Jan...  1813.0   
14    https://www.goodreads.com/author/show/947.Will...  1597.0   
26    https://www.goodreads.com/author/show/1406384....  2012.0   
32    https://www.goodr

In [46]:
#3. Mean and standard deviation 
print("Mean: \t\t", df.year.mean())
print("SD: \t\t", df.year.std())

#5. quartiles
print("Median: \t", df.year.median())
print("First quartile: ", df.year.quantile(q=0.25))
print("Third quartile: ", df.year.quantile(q=0.75))

# an easier way to describe the data
#df.year.describe()
df.describe()

Mean: 		 1969.0850992824962
SD: 		 185.38316893845032
Median: 	 2002.0
First quartile:  1980.0
Third quartile:  2009.0


Unnamed: 0,rating,review_count,year,rating_count
count,5998.0,5998.0,5993.0,5998.0
mean,4.042201,2372.487162,1969.085099,51142.78
std,0.260661,5491.177007,185.383169,137599.3
min,2.0,0.0,-1500.0,5.0
25%,3.87,389.25,1980.0,7495.25
50%,4.05,932.5,2002.0,18063.0
75%,4.21,2210.5,2009.0,42853.25
max,5.0,136455.0,2014.0,2958974.0


## Cleaning: Examining the dataframe - a deeper look

Beyond checking some quick general properties of the dataframe and looking at the first rows, we can dig a bit deeper into the values being stored. One thing that occurs frequently in real-world data are missing values. I have seen many forms of missing values, such as

- -1 or 999 for the age of patients in hospital data
- 0 for the height of patients
- NaN (not a number)

All of them need to be detected and taken care of because these missing values screw up the whole data analytics pipeline. Thereby, NaN is the nicest one because it is easy to spot.

Let's start with this one and see for a column which seemed OK to us.

In [49]:
#Get a sense of how many missing values there are in the dataframe.
df['isbn'].isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
5995    False
5996    False
5997    False
5998     True
5999    False
Name: isbn, Length: 6000, dtype: bool

### Exercise: Missing values
1. Recall the function ```sum``` that you looked up above? Figure out how many values are missing in the column ```isbn```
2. Combine the selection criteria with the function ```isnull()``` to print rows for which there is an ISBN
3. See how many missing values every column has (you can also use a for-loop on ```df.columns``` here)

In [51]:
#1. Number of missing values
#print(df['isbn'].isnull().sum())
print(df.isbn.isnull().sum())

#2. print missing values / complete cases
#print(df[df['isbn'].isnull()])
#alternative
#print(df[~df['isbn'].notnull()])

#3. missing values in each column
for c in df.columns:
  print(c)
  print(df[c].isnull().sum())
#alternative
print(df.isnull().sum())

477
rating
2
review_count
2
isbn
477
booktype
2
author_url
2
year
7
genre_urls
62
dir
0
rating_count
2
name
2
rating            2
review_count      2
isbn            477
booktype          2
author_url        2
year              7
genre_urls       62
dir               0
rating_count      2
name              2
dtype: int64


## Cleaning: Dealing with Missing Values
How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values? 

In [53]:
#Treat the missing or invalid values in the column 'year' of your dataframe
df_clean = df[df['year'].notnull()].copy()

print(df.shape)
print(df_clean.shape)

(6000, 10)
(5993, 10)


Ok so we have removed all the NaNs in the column ```year```. You can see that we only removed 7 rows, which is not too bad. Always check, how many rows you are removing as data is valuable. If you are removing 50% of the data, better think of another strategy than simply removing the rows.

As you have probably noticed above, the data type of this column was ```float```. We can now try to change that to ```int``` which makes more sense for a year.

In [55]:
#print(df.dtypes)
df_clean['year'] = df_clean['year'].astype(int)
print(df_clean.dtypes)

rating          float64
review_count    float64
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count    float64
name             object
dtype: object


### Exercise: Dealing with missing values

1. Difficult: Let's try to change the data types of ```review_count```, ```isbn```, and ```rating_count``` to the appropriate data type int. Discuss which data type is appropriate. If the type conversion fails, we now know we have further problems and have to remove rows with missing values.

Functions that can be helpful are:
- df_clean['isbn'].str.isdigit() which returns a boolean vector of True where the entry of column isbn is a digit and False otherwise.

In [57]:
#Let's see what rows still have missing values
df_clean.isnull().sum()

rating            0
review_count      0
isbn            471
booktype          0
author_url        0
year              0
genre_urls       59
dir               0
rating_count      0
name              0
dtype: int64

In [58]:
df_clean = df_clean[df_clean['isbn'].notnull()]
#df_clean = df_clean[df_clean['genre_urls'].notnull()] # wollen wir später noch ersetzen

#Missing values in column isbn are now fixed
print(df_clean.isnull().sum())

rating           0
review_count     0
isbn             0
booktype         0
author_url       0
year             0
genre_urls      26
dir              0
rating_count     0
name             0
dtype: int64


Before we continue, we want to copy the pandas Dataframe. As mentioned earlier, Pandas creates a view on the data and doesn't create a new Dataframe when we select rows as above. This can lead to problems if we now want to change the datatypes. By copying, we avoid these problems:

In [101]:
df_clean = df_clean.copy()

Now we continue with updating the datatypes.

In [103]:
df_clean['review_count'] = df_clean['review_count'].astype(int)
df_clean['rating_count'] = df_clean['rating_count'].astype(int)

In [51]:
#isbn still has some entries which are not integers. We fix that now
df_clean = df_clean[df_clean['isbn'].str.isdigit()].copy()
df_clean['isbn'] = pd.to_numeric(df_clean['isbn']) #we use the pandas function

Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:

In [105]:
df_clean.dtypes

rating          float64
review_count      int64
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count      int64
name             object
dtype: object

Sweet! Let's overwrite the df with the cleaned version.

In [107]:
df = df_clean

Some of the other colums that should be strings have NaN. We now want to set them to "" --- an empty string. You might think about something like

```
df[df['genre_urls'].isnull()]['genre_urls'] = ""
```

Please try it out and see what happens.


As mentioned before, ```df[condition]``` is creating a view on the original dataset. And Pandas doesn't allow to change values on only a view to protect the user. Instead, we can use the ```loc``` function as follows:

In [93]:
df.loc[df['genre_urls'].isnull(), 'genre_urls']=""
#df.loc[df['isbn'].isnull(), 'isbn'] = float("nan")

In [95]:
print(df['genre_urls'].isnull().sum())
#print(df['isbn'].isnull().sum())

0


Nice, now you learned the basic functionality of Pandas!!! Pandas is a super powerful package which I use daily. To learn how to work with data in Pandas Dataframes is extremely important and a skill that is very valuable. If you like you can continue learning Pandas using any kind of online course, e.g., https://www.datacamp.com/tutorial/pandas or https://www.youtube.com/watch?v=r-uOLxNrNk8.