# Exploring Pandas With SQL Commands

### Pandas

The Pandas Python library is an open-source library for data manipulation and analysis. Pandas is a widely used data analysis library in Python; It can take data from a CSV, an Excel sheet or JSON and create the data object DataFrames.

### Sql in Pandas

***pandasql*** allows you to query pandas DataFrames using SQL syntax. It provides a more familiar way of manipulating and cleaning data for people who are new to Python or pandas but more familar with SQL. Notice that with ***pandasql***, even though we write SQL-like queries, but queries are actually executed using Pandas library instead of in a database, such as postgreSQL. Therefore, to run the ***pandasql***, there is no need to first connect to a database.


### Setup

In addition to the following setup and this notebook, you'll need the `movies.csv` file from Moodle, that you need to add to Jupyter in the same directory as this notebook.

#### On a manual python/jupyter installation (without Docker)

To install Pandas library and ***pandasql***, you need to open up a terminal and type the following command:

pip3 install pandas

pip3 install -U pandasql

#### On jupyter with Docker

Go to the Jupyter home, and select `New` then `Terminal` in the top right corner of the file listing view.
Then in the terminal that appears run the following commands:

pip3 install pandas

pip3 install -U pandasql

You can ignore the warning about running as 'root'; this is normal in the Docker environment.

### Getting started

Once you have successfully installed the Pandas library and the ***pandasql***, you can load them with the following command. Notice that you don't need to connect to a database here.

In [1]:
import pandas as pd
from pandasql import sqldf

Together with this notebook, we also provide a csv file, which contains information of a movie, e.g., title, year, genre, votes and rating. Importing a csv file into pandas DataFrames can be simply implemented using the following command. Now a DataFrame called movie is created.

In [2]:
movie = pd.read_csv('movies.csv')

Next up, using a single line of code, we can set up ***pandasql*** so that we can use SQL commands to query the movie Dataframe.

In [3]:
pysqldf = lambda q: sqldf(q, globals())

To perform data analytic operation with ***pandasql***, you only needs to write the sql queries that we learn in the lecture into the function pysqldf("SQL Command"). The following example code prints the first 15 tuples in the DataFrame movie.

In [4]:
result = pysqldf("SELECT * FROM movie LIMIT 15")
print(result)

                     Genres                                 Title  Year  \
0                  Thriller                       The Manipulator  1971   
1   Action,Adventure,Comedy     The Return of the Ancient Mariner  1968   
2        Comedy,Crime,Drama                          Last Request  1957   
3    Animation,Family,Short                                Doremi  1986   
4                    Comedy               I piaceri dello scapolo  1960   
5                 Adventure                     The Karachi Story  1954   
6      Drama,Family,Romance  He Loves Me, He Loves Me Not: Part 2  1980   
7                     Drama                     Le colonel Durand  1948   
8    Adventure,Comedy,Drama              Crescendo/Three Feathers  1980   
9      Action,Drama,Romance                              Skybound  1935   
10   Animation,Comedy,Short               Hill-billing and Cooing  1956   
11            Drama,Romance                 The Wrath of the Gods  1914   
12           Comedy,Music

Now there is a new action movie called Tenet and we want to add it into this DataFrame. The votes are 4000 and the rating is 9.1. Try to insert this record into the movie DataFrame using the following ***pandasql***command:

In [5]:
movie = pysqldf("INSERT INTO movie VALUES ( 'Action', 'Tenet', 2020, 4000, 9.1 )")

PandaSQLException: (sqlite3.OperationalError) no such table: movie
[SQL: INSERT INTO movie VALUES ( 'Action', 'Tenet', 2020, 4000, 9.1 )]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

You see an error following up the INSERT command. This is because currently the ***pandasql*** doesn't support UPDATE, DELETE or INSERT quries. Only filtering, aggregation and join queries are supported. 

Therefore, in order to add a new row into the pandas Dataframe, you should use the native pandas API. For more information, you could see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

In [6]:
new_tuple = {'Genres':'Action', 'Title':'Tenet', 'Year':2020, 'Votes':4000, 'Rating':9.1}
movie = movie.append(new_tuple, ignore_index=True)
print(pysqldf("SELECT * FROM movie WHERE Title = 'Tenet'"))

  movie = movie.append(new_tuple, ignore_index=True)


   Genres  Title  Year  Votes  Rating
0  Action  Tenet  2020   4000     9.1


Now write a sql query with pandsql that lists 15 highest-rated movies with more than 100 votes from year 2000. What is the rank of the movie Tenet that we just added?

In [11]:
result = pysqldf("SELECT TITLE FROM movie WHERE VOTES > 100 AND YEAR > 2000 ORDER BY RATING DESC LIMIT 15")
print(result)

                              Title
0                         Sailcloth
1              David Bowie: Lazarus
2                The Saint's Supper
3                             Tenet
4             The Gates of Judgment
5      Jungles: People of the Trees
6                        The Debate
7   Arctic: Life in the Deep Freeze
8       Mountains: Life in Thin Air
9              Blood-Stained Reward
10                  Civilization IV
11            Oceans: Into the Blue
12                  Time Commanders
13   Grasslands: The Roots of Power
14           Rivers: Friend and Foe


Write a sql query with pandsql that count the number of movies collected in this dataset from year 2000.

In [14]:
result = pysqldf("SELECT COUNT(TITLE) FROM movie WHERE YEAR > 2000")
print(result)

   COUNT(TITLE)
0          2010
