# Data Science Laboratory: Pandas and Python 
#### By: Javier Orduz
[license-badge]: https://img.shields.io/badge/License-CC-orange
[license]: https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en

[![CC License][license-badge]][license]  [![DS](https://img.shields.io/badge/downloads-DS-green)](https://github.com/Earlham-College/DS_Fall_2022)  [![Github](https://img.shields.io/badge/jaorduz-repos-blue)](https://github.com/jaorduz/)  ![Follow @jaorduc](https://img.shields.io/twitter/follow/jaorduc?label=follow&logo=twitter&logoColor=lkj&style=plastic)


We load the different packages that we will use.

In [35]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

<h1>Table of contents</h1>

<div class="alert  alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#unData">Python</a></li>
         <ol>
             <li><a href="#reData">Reading</a></li>
             <li><a href="#exData">Exploration</a></li>
         </ol>
        <li><a href="#daExploration">Pandas</a></li>
        <li><a href="#simRegression">Experiments</a></li>
        <li><a href="#simRegression">Conclussions</a></li>
    </ol>
</div>
<br>
<hr>


<h2 id="unData">Data</h2>

### `FuelConsumption.csv`:

This dataset contains a model-specific fuel consumption ratings and estimated carbon dioxide 
emissions for new light-duty vehicles for retail sale in Canada.

Some **features** are

- **rating** e.g. 4.40
- **isbn** e.g. 0439023483
- **implement more**

In [48]:
df=pd.read_csv("../data/books.csv", header=None,
    names=["rating", 'review_count', 'isbn', 
    'booktype','author_url', 'year', 'genre_urls', 
    'dir','rating_count', 'name'],
)
df.head()

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
0,4.4,136455,439023483,good_reads:book,https://www.goodreads.com/author/show/153394.S...,2008.0,/genres/young-adult|/genres/science-fiction|/g...,dir01/2767052-the-hunger-games.html,2958974,"The Hunger Games (The Hunger Games, #1)"
1,4.41,16648,439358078,good_reads:book,https://www.goodreads.com/author/show/1077326....,2003.0,/genres/fantasy|/genres/young-adult|/genres/fi...,dir01/2.Harry_Potter_and_the_Order_of_the_Phoe...,1284478,Harry Potter and the Order of the Phoenix (Har...
2,3.56,85746,316015849,good_reads:book,https://www.goodreads.com/author/show/941441.S...,2005.0,/genres/young-adult|/genres/fantasy|/genres/ro...,dir01/41865.Twilight.html,2579564,"Twilight (Twilight, #1)"
3,4.23,47906,61120081,good_reads:book,https://www.goodreads.com/author/show/1825.Har...,1960.0,/genres/classics|/genres/fiction|/genres/histo...,dir01/2657.To_Kill_a_Mockingbird.html,2078123,To Kill a Mockingbird
4,4.23,34772,679783261,good_reads:book,https://www.goodreads.com/author/show/1265.Jan...,1813.0,/genres/classics|/genres/fiction|/genres/roman...,dir01/1885.Pride_and_Prejudice.html,1388992,Pride and Prejudice


We explore the types

In [49]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

We find the number of rows and columns.

In [51]:
df.shape

(6000, 10)

###  Some experiments with different subjects

Filtering the data set.

One way to get a filetered dataframe

In [52]:
df.query("rating > 4.95")

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
1718,5.0,28,,good_reads:book,https://www.goodreads.com/author/show/6467808....,2014.0,/genres/poetry|/genres/childrens,dir18/22204746-an-elephant-is-on-my-house.html,64,An Elephant Is On My House
2145,5.0,3,1300589469.0,good_reads:book,https://www.goodreads.com/author/show/6906561....,2012.0,,dir22/17287259-a-book-about-absolutely-nothing...,63,A Book About Absolutely Nothing.
2903,5.0,0,983002282.0,good_reads:book,https://www.goodreads.com/author/show/6589034....,2012.0,,dir30/17608096-obscured-darkness.html,8,Obscured Darkness (Family Secrets #2)
2909,5.0,0,983002215.0,good_reads:book,https://www.goodreads.com/author/show/6589034....,2011.0,,dir30/16200303-family-secrets.html,9,Family Secrets
4473,5.0,0,,good_reads:book,https://www.goodreads.com/author/show/6896621....,2012.0,,dir45/17259227-patience-s-love.html,7,Patience's Love
5564,5.0,9,,good_reads:book,https://www.goodreads.com/author/show/7738947....,2014.0,/genres/romance|/genres/new-adult,dir56/21902777-untainted.html,14,"Untainted (Photographer Trilogy, #3)"
5692,5.0,0,,good_reads:book,https://www.goodreads.com/author/show/5989528....,2012.0,,dir57/14288412-abstraction-in-theory---laws-of...,6,Abstraction In Theory - Laws Of Physical Trans...


We create a mask and use it to "index," namely ```df.year``` into the dataframe to get the rows we want.

In [40]:
df[df.year >= 2014]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
911,4.85,26,1491732954,good_reads:book,https://www.goodreads.com/author/show/8189303....,2014.0,/genres/fiction,dir10/22242097-honor-and-polygamy.html,97,Honor and Polygamy
925,4.21,9323,0007466714,good_reads:book,https://www.goodreads.com/author/show/2987125....,2014.0,/genres/young-adult|/genres/science-fiction|/g...,dir10/20572939-the-one.html,64518,"The One (The Selection, #3)"
938,4.51,11011,1481426303,good_reads:book,https://www.goodreads.com/author/show/150038.C...,2014.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir10/8755785-city-of-heavenly-fire.html,69924,"City of Heavenly Fire (The Mortal Instruments,..."
1115,4.48,5648,,good_reads:book,https://www.goodreads.com/author/show/4637539....,2014.0,/genres/science-fiction|/genres/dystopia|/genr...,dir12/13188676-ignite-me.html,30166,"Ignite Me (Shatter Me, #3)"
1300,4.61,24,1499227299,good_reads:book,https://www.goodreads.com/author/show/7414345....,2014.0,/genres/paranormal|/genres/vampires|/genres/pa...,dir14/22090082-vampire-princess-rising.html,128,Vampire Princess Rising (The Winters Family Sa...
...,...,...,...,...,...,...,...,...,...,...
5882,4.35,1139,,good_reads:book,https://www.goodreads.com/author/show/7056140....,2014.0,/genres/romance|/genres/romance|/genres/contem...,dir59/18138755-forever-with-you.html,13472,"Forever with You (Fixed, #3)"
5883,4.30,1049,0525426361,good_reads:book,https://www.goodreads.com/author/show/7014881....,2014.0,/genres/non-fiction|/genres/autobiography|/gen...,dir59/17675031-this-star-won-t-go-out.html,6954,This Star Won't Go Out
5884,3.83,2046,0670016780,good_reads:book,https://www.goodreads.com/author/show/7314532....,2014.0,/genres/fantasy|/genres/young-adult|/genres/fa...,dir59/18079804-half-bad.html,9201,"Half Bad (Half Life, #1)"
5946,4.22,1159,0373211120,good_reads:book,https://www.goodreads.com/author/show/2995873....,2014.0,/genres/paranormal|/genres/vampires|/genres/yo...,dir60/17883441-the-forever-song.html,5953,"The Forever Song (Blood of Eden, #3)"


Combining conditions

In [53]:
df[(df.year >= 2014.0) & (df.rating > 4.95)]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
1718,5.0,28,,good_reads:book,https://www.goodreads.com/author/show/6467808....,2014.0,/genres/poetry|/genres/childrens,dir18/22204746-an-elephant-is-on-my-house.html,64,An Elephant Is On My House
5564,5.0,9,,good_reads:book,https://www.goodreads.com/author/show/7738947....,2014.0,/genres/romance|/genres/new-adult,dir56/21902777-untainted.html,14,"Untainted (Photographer Trilogy, #3)"


We want to change the ```type```, so recall the types

In [54]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year            float64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

Now, we want to change the type of data for the year, it means, we want year atribute be ```int```

In [43]:
df['year']=df.year.astype(int)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

But, we got a problem, this is because, we can see some incomplete data in the data set. We had to check about the dataset, namely, we call year atribute

In [44]:
df[df.year.isnull()]

Unnamed: 0,rating,review_count,isbn,booktype,author_url,year,genre_urls,dir,rating_count,name
2442,4.23,526.0,,good_reads:book,https://www.goodreads.com/author/show/623606.A...,,/genres/religion|/genres/islam|/genres/non-fic...,dir25/1301625.La_Tahzan.html,4134.0,La Tahzan
2869,4.61,2.0,,good_reads:book,https://www.goodreads.com/author/show/8182217....,,,dir29/22031070-my-death-experiences---a-preach...,23.0,My Death Experiences - A Preacherâs 18 Apoca...
3643,,,,,,,,dir37/9658936-harry-potter.html,,
5282,,,,,,,,dir53/113138.The_Winner.html,,
5572,3.71,35.0,8423336603.0,good_reads:book,https://www.goodreads.com/author/show/285658.E...,,/genres/fiction,dir56/890680._rase_una_vez_el_amor_pero_tuve_q...,403.0,Ãrase una vez el amor pero tuve que matarlo. ...
5658,4.32,44.0,,good_reads:book,https://www.goodreads.com/author/show/25307.Ro...,,/genres/fantasy|/genres/fantasy|/genres/epic-f...,dir57/5533041-assassin-s-apprentice-royal-assa...,3850.0,Assassin's Apprentice / Royal Assassin (Farsee...
5683,4.56,204.0,,good_reads:book,https://www.goodreads.com/author/show/3097905....,,/genres/fantasy|/genres/young-adult|/genres/ro...,dir57/12474623-tiger-s-dream.html,895.0,"Tiger's Dream (The Tiger Saga, #5)"


In [45]:
df = df[df.year.notnull()]
df.shape

(5993, 10)

In [46]:
df['year']=df.year.astype(int)

We removed those seven rows.

In [47]:
df.dtypes

rating          float64
review_count     object
isbn             object
booktype         object
author_url       object
year              int64
genre_urls       object
dir              object
rating_count     object
name             object
dtype: object

## Versions

In [None]:
from platform import python_version
print("python version: ", python_version())
!pip3 freeze | grep qiskit

# References

[0] data https://tinyurl.com/2m3vr2xp

[1] numpy https://numpy.org/

[2] scipy https://docs.scipy.org/

[3] matplotlib https://matplotlib.org/

[4] matplotlib.cm https://matplotlib.org/stable/api/cm_api.html

[5] matplotlib.pyplot https://matplotlib.org/stable/api/pyplot_summary.html

[6] pandas https://pandas.pydata.org/docs/

[7] seaborn https://seaborn.pydata.org/
