# Tutorials or just some Readings

## Syntax for writing notes in Jupyter

There are syntax and ways on writing notes in your jupyter. I will explain all of the things I know here.<br>
**Headings **

`#` this is the title<br>
`##` major headings<br>
`###` subheadings<br>
`####` 4th level subheadings

`<br>` is used in entering a new line text in each lines above.

If you want to bold something you use **this is bold using two stars.**

*This is italicized using one star each side*

Use the `>` symbol to put like a tab. See the example:

> This is the tabbed text

You can also write code inside the markdown notes using three ticks ``` and the language

```python

def func():
    print ("Hello World")
    
```<br>
**Next is the Colored Text**
<font color = blue>This is a colored text using the code `<font color = insert color here></font>`

**How to write equations**

The way to write this is using this code `$$\alpha + \frac{\beta}{\gamma} = \delta$$` $$\alpha + \frac{\beta}{\gamma} = \delta$$


This used the syntax on taking down notes as not in line. The syntax used `two dollar` signs. If you want the text to be inline with the sentence or the paragraph then you can use only `one dollar` sign.

This is an example of an inline equations. $\alpha + \frac{\beta}{\gamma} = \delta$

**List**

Next this is how you do a list.

You can use dash `-` symbol each item or a number followed by a dot like this `1.`

1. first
2. second
3. third


- one
- two
- three
 - this is for a square bullet


**Image sources** are the same as you write the code in html for images.
The `<src>` tag.

`<img src="url.gif" alt="Alt text that describes the graphic" title="Title text"/>`<br>
Example:
<img src="dont-quit-wallpaper.jpg" alt="Alt text that describes the graphic" title="Title text" width = '150' />

**Internal links**: To link to a section, use this code: `[section title](#section-title)` For the text in the parentheses, replace spaces and special characters with a hyphen.<br>

Example:

[Syntax for writing notes in Jupyter](#Syntax-for-writing-notes-in-Jupyter)

External links: Use this code and test all links! `__[link text](http://url)__`

### Now we go straight to coding

The matplotlib inline is a function or a Jupyter / Ipython exclusive command which tells the notebook to produce the plots inside this notebook.

```python
%matplotlib inline
#this line above prepares IPython notebook for working with matplotlib

# All the "as..." constructs are placeholders for the modules 
# which saves them into easier to call variables


import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
```

In the code above, what is this thing or command saying `import numpy as np`?
`numpy` is a library of useful things, Numpy is Numerical Python which does computations in an array. Normally in python you can't do anything with lists so Numpy let's you do computations in these arrays.

So what are Libraries?

Libraries are codes written by developers to help people save time by not writing the code over and over again. The codes run in python (but necessarily written in the language) where it is compiled in a module and hides the way it is written but simply gives you the result you needed in the most intuitive way possible. This libraries contain codes made for specific functions, in this case numpy is for numerical formulas like multiplication (most basic) and other things on arrays. This way you don't have to write long lines of codes and only use the dot notation `(name.function)` and have the result you needed.

**What is this `as` thing in the code above?**

The `as` command is just a command to save the library into an easier callable name or the shorter version so that you call the module with less words or easier to remember. This makes your work intuitive as possible as sometimes modules have names that are only historically meaningful but not intuitive.

In [2]:
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd

# this is called aliasing
# aliasing is used to call modules or libraries into a shorter form variable

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True) # will display as html

import seaborn as sns
# a high level interface for plotting
# a beautifier for your plots and graphs

Numpy
> Numerical Python - fast list/array object computations

Scipy
> Statistical Functions / Distributions / Optimizations

Matplotlib
> Plotting Library

Color Maps
> Module inside Matplotlib 

Pyplot
> Do the Plotting

Pandas
> Concepts from dataframes in R (Programming Language)

In [3]:
df = pd.read_csv('all.csv', header = None, 
                 names = ['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_url',  'dir', 'rating_count', 'name'])

df.head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
0,4.4,136455,439023483,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2008.0,/genres/young-adult|/genres/science-fiction|/g...,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)"
1,4.41,16648,439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003.0,/genres/fantasy|/genres/young-adult|/genres/fi...,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...
2,3.56,85746,316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005.0,/genres/young-adult|/genres/fantasy|/genres/ro...,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)"
3,4.23,47906,61120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960.0,/genres/classics|/genres/fiction|/genres/histo...,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird
4,4.23,34772,679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813.0,/genres/classics|/genres/fiction|/genres/roman...,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice


### So how does this code work?

```python
df = pd.read_csv('all.csv', header = None, 
                 names = ['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_urls', 'dir', 'rating_count', 'name'])

df.head()
```

Here the code `df = pd.read_csv`
> the code or the commands will be saved into a variable named `df`

The `pd.read_csv`
> the `pd` is the pandas library and inside this library is a module or function called read_csv which reads csv files

The contents inside the parenthesis are self - explanatory
> the first `all.csv` is the filename within your folder or directory, `header` if your data has headers, and `names` for your header names

The `df.head` is just a command to display the first 5 objects that your dataset have.


Dataframe concepts come from R programming language. Dataframes are considered columns of panda objects. This is a little confusing as of now really.

Pandas recognize this as columns pasted together. <img src = "pandastruct.png">

In [4]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_url        object
dir              object
rating_count     object
name             object
dtype: object

Why is checking the data types important?

As python syntax or the rules of python says, in an array of data; for example numbers or integers, if there is an even single data that is encoded using alphabet or other symbols, python will read and treat **ALL** of the data as strings or in this case objects.

This is useful when checking the data for errors or any incorrect data encoding.

rating         ---  object `<<<<<<< why is this called an object?` *Objects are reffered to as strings | rating should be float*<br>
review_count     --- object <br>
isbn           --- float64 `<<<these are floating point objects (decimals)` <br>
booktype        --- object <br>
author_url      --- object <br>
year            --- object <br>
genre_url       --- object <br>
dtype: object

>`df.shape` command will show the (no. of rows, no. of columns) if the data is a 2 dimensional one.

In [5]:
df.shape

(6000, 10)

In [6]:
df.columns

Index(['rating', 'review_count', 'isbn', 'booktype', 'author_url', 'year', 'genre_url', 'dir', 'rating_count', 'name'], dtype='object')

It shows you what are the columns

#### What if you want to see the data in just one of the columns?

You will use the syntax `df.column_Name`.

In [7]:
df.rating

0       4.40
1       4.41
2       3.56
3       4.23
4       4.23
5       4.25
6       4.22
7       4.38
8       3.79
9       4.18
10      4.03
11      3.72
12      4.36
13      4.05
14      3.72
15      4.09
16      3.92
17      4.58
18      3.60
19      4.28
20      4.02
21      4.14
22      4.11
23      4.20
24      3.75
25      3.94
26      4.43
27      3.79
28      4.04
29      3.94
        ... 
5970    3.97
5971    4.24
5972    4.19
5973    4.17
5974    3.99
5975    4.07
5976    4.23
5977    4.03
5978    3.99
5979    2.77
5980    3.84
5981    3.36
5982    4.09
5983    4.23
5984    4.02
5985    3.61
5986    4.06
5987    4.26
5988    4.34
5989    3.36
5990    4.12
5991    4.20
5992    3.89
5993    4.09
5994    4.37
5995    4.17
5996    3.99
5997    3.78
5998    3.91
5999    4.35
Name: rating, Length: 6000, dtype: float64

`df.rating < 3` This syntax is called boolean mask. Testing each of the objects in the array under a certain test condition. It goes one by one into the entire series and returns True or False depending if the conditions are met.

In [8]:
show = (df.rating < 3)
show.head(5)

0    False
1    False
2    False
3    False
4    False
Name: rating, dtype: bool

This gives us `True`s and `False`s. Such a series is called a mask. If we count the number of `True`s, and divide by the total, we'll get the fraction of ratings $\lt$ 3. To do this numerically see this:

In [9]:
np.sum(df.rating < 3)

4

`np.sum` command is a command to add the values, when used altogether in the above code `df.rating < 3`, all values that are lower than 3 are considered as a binary value `1`. When asked for the sum it gave us the value which is all of the values lower than 3 are just 4 when added as True = binary value 1.

The code below `df[df.rating < 3]` will ask for the entire dataframe and only give you what is true inside the brackets.

In [10]:
df[df.rating < 3]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
2609,2.9,8,,good_reads:book,https://www.goodreads.com/author/show/7707820....,2013.0,/genres/romance|/genres/realistic-fiction|/gen...,dir27/19546932-how-to-be-a-perfect-girl.html,31,How To Be A Perfect Girl
3738,2.0,368,983650322.0,good_reads:book,https://www.goodreads.com/author/show/9414.Vic...,2011.0,/genres/young-adult|/genres/science-fiction|/g...,dir38/12393909-revealing-eden.html,688,"Revealing Eden (Save the Pearls, #1)"
5844,2.97,1399,395083621.0,good_reads:book,https://www.goodreads.com/author/show/30691.Ad...,1925.0,/genres/history|/genres/non-fiction|/genres/bi...,dir59/54270.Mein_Kampf.html,12417,Mein Kampf
5979,2.77,800,60988649.0,good_reads:book,https://www.goodreads.com/author/show/7025.Gre...,2001.0,/genres/fantasy|/genres/fiction|/genres/myster...,dir60/24929.Lost.html,11128,Lost


You can also just use a column in a dataframe and filter out other things using the code below.

In [11]:
df.review_count[df.rating < 3]

2609       8
3738     368
5844    1399
5979     800
Name: review_count, dtype: object

In [12]:
#Another way of doing this is 
df_sub1 = df[df.rating < 3]
df_sub1.head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
2609,2.9,8,,good_reads:book,https://www.goodreads.com/author/show/7707820....,2013.0,/genres/romance|/genres/realistic-fiction|/gen...,dir27/19546932-how-to-be-a-perfect-girl.html,31,How To Be A Perfect Girl
3738,2.0,368,983650322.0,good_reads:book,https://www.goodreads.com/author/show/9414.Vic...,2011.0,/genres/young-adult|/genres/science-fiction|/g...,dir38/12393909-revealing-eden.html,688,"Revealing Eden (Save the Pearls, #1)"
5844,2.97,1399,395083621.0,good_reads:book,https://www.goodreads.com/author/show/30691.Ad...,1925.0,/genres/history|/genres/non-fiction|/genres/bi...,dir59/54270.Mein_Kampf.html,12417,Mein Kampf
5979,2.77,800,60988649.0,good_reads:book,https://www.goodreads.com/author/show/7025.Gre...,2001.0,/genres/fantasy|/genres/fiction|/genres/myster...,dir60/24929.Lost.html,11128,Lost


In [13]:
df_sub1.review_count

2609       8
3738     368
5844    1399
5979     800
Name: review_count, dtype: object

The above code is useful if you just wanna see something or you just want to change the values of certain elements in your dataframe.

In [14]:
np.sum(df.rating < 3)/df.shape[0]

0.0006666666666666666

In [15]:
1/4, 1.0/4, 1//4

(0.25, 0.25, 0)

In [16]:
np.mean(df.rating < 3)

0.0006666666666666666

In [17]:
(df.rating < 3).mean()

0.0006666666666666666

In [18]:
df.rating.mean(), df.rating.std()

(4.042200733577858, 0.2606608212818681)

#### What can you infer from the calculations above?

In  terms of probability, the probability of you getting a book with rating less than 3 in the population of the best 6000 books from good reads is **0.000667**

In [19]:
df.query("rating > 4.5")

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
17,4.58,1314,0345538374,good_reads:book,https://www.goodreads.com/author/show/656983.J...,1973.0,/genres/fantasy|/genres/classics|/genres/scien...,dir01/30.J_R_R_Tolkien_4_Book_Boxed_Set.html,68495,J.R.R. Tolkien 4-Book Boxed Set
162,4.55,15777,075640407X,good_reads:book,https://www.goodreads.com/author/show/108424.P...,2007.0,/genres/fantasy|/genres/fiction,dir02/186074.The_Name_of_the_Wind.html,210018,The Name of the Wind (The Kingkiller Chronicle...
222,4.53,15256,055357342X,good_reads:book,https://www.goodreads.com/author/show/346732.G...,2000.0,/genres/fantasy|/genres/fiction|/genres/fantas...,dir03/62291.A_Storm_of_Swords.html,327992,"A Storm of Swords (A Song of Ice and Fire, #3)"
242,4.53,5404,0545265355,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2010.0,/genres/young-adult|/genres/fiction|/genres/fa...,dir03/7938275-the-hunger-games-trilogy-boxset....,102330,The Hunger Games Trilogy Boxset (The Hunger Ga...
249,4.80,644,0740748475,good_reads:book,https://www.goodreads.com/author/show/13778.Bi...,2005.0,/genres/sequential-art|/genres/comics|/genres/...,dir03/24812.The_Complete_Calvin_and_Hobbes.html,22674,The Complete Calvin and Hobbes
284,4.58,15195,1406321346,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2013.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir03/18335634-clockwork-princess.html,130161,"Clockwork Princess (The Infernal Devices, #3)"
304,4.54,572,0140259449,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1933.0,/genres/classics|/genres/fiction|/genres/roman...,dir04/14905.The_Complete_Novels.html,17539,The Complete Novels
386,4.55,8820,0756404738,good_reads:book,https://www.goodreads.com/author/show/108424.P...,2011.0,/genres/fantasy|/genres/fantasy|/genres/epic-f...,dir04/1215032.The_Wise_Man_s_Fear.html,142499,"The Wise Man's Fear (The Kingkiller Chronicle,..."
400,4.53,9292,1423140605,good_reads:book,https://www.goodreads.com/author/show/15872.Ri...,2012.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir05/12127750-the-mark-of-athena.html,128412,"The Mark of Athena (The Heroes of Olympus, #3)"
475,4.57,824,1416997857,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2009.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir05/6485421-the-mortal-instruments-boxed-set...,39720,The Mortal Instruments Boxed Set (The Mortal I...


In [20]:
df[df.year < 0]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
47,3.68,5785,0143039954,good_reads:book,https://www.goodreads.com/author/show/903.Homer,-800.0,/genres/classics|/genres/fiction|/genres/poetr...,dir01/1381.The_Odyssey.html,560248,The Odyssey
246,4.01,365,0147712556,good_reads:book,https://www.goodreads.com/author/show/903.Homer,-800.0,/genres/classics|/genres/fantasy|/genres/mytho...,dir03/1375.The_Iliad_The_Odyssey.html,35123,The Iliad/The Odyssey
455,3.85,1499,0140449140,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-380.0,/genres/philosophy|/genres/classics|/genres/no...,dir05/30289.The_Republic.html,82022,The Republic
596,3.77,1240,0679729526,good_reads:book,https://www.goodreads.com/author/show/919.Virgil,-29.0,/genres/classics|/genres/poetry|/genres/fictio...,dir06/12914.The_Aeneid.html,60308,The Aeneid
629,3.64,1231,1580495931,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-429.0,/genres/classics|/genres/plays|/genres/drama|/...,dir07/1554.Oedipus_Rex.html,93192,Oedipus Rex
674,3.92,3559,1590302257,good_reads:book,https://www.goodreads.com/author/show/1771.Sun...,-512.0,/genres/non-fiction|/genres/politics|/genres/c...,dir07/10534.The_Art_of_War.html,114619,The Art of War
746,4.06,1087,0140449183,good_reads:book,https://www.goodreads.com/author/show/5158478....,-500.0,/genres/classics|/genres/spirituality|/genres/...,dir08/99944.The_Bhagavad_Gita.html,31634,The Bhagavad Gita
777,3.52,1038,1580493882,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-442.0,/genres/drama|/genres/fiction|/genres/classics...,dir08/7728.Antigone.html,49084,Antigone
1233,3.94,704,015602764X,good_reads:book,https://www.goodreads.com/author/show/1002.Sop...,-400.0,/genres/classics|/genres/plays|/genres/drama|/...,dir13/1540.The_Oedipus_Cycle.html,36008,The Oedipus Cycle
1397,4.03,890,0192840509,good_reads:book,https://www.goodreads.com/author/show/12452.Aesop,-560.0,/genres/classics|/genres/childrens|/genres/lit...,dir14/21348.Aesop_s_Fables.html,71259,Aesop's Fables


It looks confusing at first, but it actually means the year was BC.

In [21]:
df[(df.year < 0) & (df.rating > 4)]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
246,4.01,365,147712556,good_reads:book,https://www.goodreads.com/author/show/903.Homer,-800.0,/genres/classics|/genres/fantasy|/genres/mytho...,dir03/1375.The_Iliad_The_Odyssey.html,35123,The Iliad/The Odyssey
746,4.06,1087,140449183,good_reads:book,https://www.goodreads.com/author/show/5158478....,-500.0,/genres/classics|/genres/spirituality|/genres/...,dir08/99944.The_Bhagavad_Gita.html,31634,The Bhagavad Gita
1397,4.03,890,192840509,good_reads:book,https://www.goodreads.com/author/show/12452.Aesop,-560.0,/genres/classics|/genres/childrens|/genres/lit...,dir14/21348.Aesop_s_Fables.html,71259,Aesop's Fables
1882,4.02,377,872205541,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-400.0,/genres/philosophy|/genres/classics|/genres/no...,dir19/22632.The_Trial_and_Death_of_Socrates.html,18712,The Trial and Death of Socrates
3133,4.3,131,872203492,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-400.0,/genres/philosophy|/genres/classics|/genres/no...,dir32/9462.Complete_Works.html,7454,Complete Works
4475,4.11,281,865163480,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-390.0,/genres/philosophy|/genres/classics|/genres/no...,dir45/73945.Apology.html,11478,Apology
5367,4.07,133,872206335,good_reads:book,https://www.goodreads.com/author/show/879.Plato,-360.0,/genres/philosophy|/genres/classics|/genres/no...,dir54/30292.Five_Dialogues.html,9964,Five Dialogues


The code above is filtering using 2 test conditions. Encase the dataframes within the brackets.

## Cleaning

Remember the datatypes? The `dtype` command?

In [22]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_url        object
dir              object
rating_count     object
name             object
dtype: object

In [23]:
df['rating_count'] = df.rating_count.astype(int) # on the left side are just the same value but you can use special characters
df['review_count'] = df.review_count.astype(int)
df['year'] = df.year.astype(int)

ValueError: invalid literal for int() with base 10: 'None'

The command above gives you an error. You tried to make the values in the columns `rating_count` into `int` but it does not work. Why?

> ValueError: invalid literal for int() with base 10: 'None'

This is the error message.

We can check if there are `null` values in the rating_count column.

In [24]:
df[df.rating_count.isnull()]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name


Turns out, there are no null values in rating count. We can check the next one.

In [25]:
df[df.review_count.isnull()]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name


In [26]:
df[df.year.isnull()]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_url,dir,rating_count,name
2442,4.23,526.0,,good_reads:book,https://www.goodreads.com/author/show/623606.A...,,/genres/religion|/genres/islam|/genres/non-fic...,dir25/1301625.La_Tahzan.html,4134.0,La Tahzan
2869,4.61,2.0,,good_reads:book,https://www.goodreads.com/author/show/8182217....,,,dir29/22031070-my-death-experiences---a-preach...,23.0,My Death Experiences - A Preacherâs 18 Apoca...
3643,,,,,,,,dir37/9658936-harry-potter.html,,
5282,,,,,,,,dir53/113138.The_Winner.html,,
5572,3.71,35.0,8423336603.0,good_reads:book,https://www.goodreads.com/author/show/285658.E...,,/genres/fiction,dir56/890680._rase_una_vez_el_amor_pero_tuve_q...,403.0,Ãrase una vez el amor pero tuve que matarlo. ...
5658,4.32,44.0,,good_reads:book,https://www.goodreads.com/author/show/25307.Ro...,,/genres/fantasy|/genres/fantasy|/genres/epic-f...,dir57/5533041-assassin-s-apprentice-royal-assa...,3850.0,Assassin's Apprentice / Royal Assassin (Farsee...
5683,4.56,204.0,,good_reads:book,https://www.goodreads.com/author/show/3097905....,,/genres/fantasy|/genres/young-adult|/genres/ro...,dir57/12474623-tiger-s-dream.html,895.0,"Tiger's Dream (The Tiger Saga, #5)"


See the table and you can find NaN values. NaN stands for Not a number. There are also `None` string values in the rating_count column. You can change the values but you can also get rid of these inputs using the code below.

In [27]:
df = df[df.year.notnull()]
df.shape

(5993, 10)

How does the code above works?

The code `df = df[df.year.notnull]`.

First I made a new variable `df` indicating that I created a new dataframe and is different from the original dataframe. 

The code `df[df.year.notnull()]` asks the computer to save the values which are **not null** to the new variable `df` so that all the values do not have null values.

I then checked the shape of the dataframe using the code `df.shape` to check the number of columns and rows still available in the variable.

In [30]:
df['rating_count'] = df.rating_count.astype(int)
df['review_count'] = df.review_count.astype(int)
df['year'] = df.year.astype(int)

In [31]:
df.dtypes

rating          float64
review_count      int32
isbn             object
booktype         object
author_url       object
year              int32
genre_url        object
dir              object
rating_count      int32
name             object
dtype: object