# <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas (7)</p>

<div class="alert alert-block alert-info alert">  
    
# <span style=" color:red">  Data Inputs and Outputs

## Table of Contents
**1. CSV Files**
* Find file paths
* CSV Input: read_csv()
* header = None
* index_col
* CSV Output (Save): to_csv
* index = True or False

**2. Excel Files**
* Excel File Input: read_excel()
* Excel File Output: to_excel
  
**3. HTML Tables**
* HTML Input: read_html()
* HTML Output: to_html()
  
**4. SQL Connections**
* Create a temporary SQLite database
* Write to database: to_sql
* Import data from a database: read_sql()
* Use SQL commands: read_sql_query()

# Data Inputs and Outputs
* To import and read datasets, we use  **pandas.read_** methods. On the other hand, to export (write) data, we use **DataFrame.to_** methods.
* For details, see https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
* To read datasets, we need to know the exact directory location and correct file name.
* We may need passwords or permissions for certain data inputs (e.g. a SQL database password).

<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="12%" />
<col width="40%" />
<col width="24%" />
<col width="24%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Format Type</th>
<th class="head">Data Description</th>
<th class="head">Reader</th>
<th class="head">Writer</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></td>
<td><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></td>
<td><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td><a class="reference external" href="https://www.json.org/">JSON</a></td>
<td><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></td>
<td><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></td>
</tr>
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></td>
<td><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></td>
<td><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td>Local clipboard</td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></td>
<td><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></td>
<td><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="http://www.opendocumentformat.org">OpenDocument</a></td>
<td><a class="reference internal" href="#io-ods"><span class="std std-ref">read_excel</span></a></td>
<td>&#160;</td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://msgpack.org/index.html">Msgpack</a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></td>
<td><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></td>
<td><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></td>
<td><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></td>
<td>&#160;</td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></td>
</tr>
<tr class="row-odd"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></td>
</tr>
<tr class="row-even"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google Big Query</a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></td>
</tr>
</tbody>
</table>

In [1]:
import numpy as np
import pandas as pd

## 1. CSV Files

### Find file paths

In [2]:
# Find your directory
# pwd  # it gives error without using the codes below in Python
import os
os.getcwd() 

# cwd: current working directory

'C:\\Users\\admin\\Desktop'

In [3]:
# list of my current directory
# ls   # it also gives error. Instead, use "%ls"
%ls

 Volume in drive C has no label.
 Volume Serial Number is 2AA1-FCAF

 Directory of C:\Users\admin\Desktop

05/16/2024  07:29 PM    <DIR>          .
05/16/2024  07:29 PM    <DIR>          ..
05/16/2024  07:14 PM    <DIR>          .ipynb_checkpoints
03/11/2024  11:30 AM             2,279 AnkiApp.lnk
03/01/2023  12:02 AM    <DIR>          Batch_84_github
04/18/2024  12:16 AM    <DIR>          Belgeler
04/17/2024  11:01 AM    <DIR>          Data Science
05/01/2024  04:15 PM    <DIR>          EGITIM
09/18/2023  04:42 PM                51 example.csv
05/15/2024  09:23 PM    <DIR>          Exploratory Data Analysis
05/14/2022  01:43 PM        49,638,112 Git-2.36.1-64-bit.exe
03/22/2023  07:18 PM    <DIR>          Grup_2
01/23/2024  12:36 PM    <DIR>          Internship
05/08/2024  07:19 PM    <DIR>          Interview
01/30/2023  10:57 PM    <DIR>          MuhammedSevim
09/18/2023  04:42 PM             5,022 my_excel_file.xlsx
05/16/2024  07:27 PM                40 newexample.csv
05/14/2024  1

### CSV Input: read_csv()

In [4]:
df = pd.read_csv("example.csv")
df   # By default, it shows the first line as header (a,b,c,d)

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


#### header = None

In [5]:
# If you do not want to make the first line header...
df = pd.read_csv("example.csv", header=None)
df
# This time, it will assign automatic indexes (0,1,2,3) and read the first line as data values(a,b,c,d)

Unnamed: 0,0,1,2,3
0,a,b,c,d
1,0,1,2,3
2,4,5,6,7
3,8,9,10,11
4,12,13,14,15


#### index_col=...

In [6]:
# Let's make a,b,c,d column names and assign "a" as an index.
# We can do it when we import data 
df = pd.read_csv("example.csv", index_col=0)  # since "a" is the first column, we used "0"
df

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15


### CSV Output (Save): to_csv

* If you do not want to save the index, set **index=False**.
* Otherwise it will add a new column to the .csv file that includes your index and call it "Unnamed: 0" if your index did not have a name.
* If you want to save your index, simply set it to True (the default value).

In [7]:
# Name and save it in the same location
df.to_csv("newexample.csv")

# IF you want to save it to a different location use its directory
# df.to_csv("C:\\Users\\admin\\Desktop\\...newexample.csv'")

#### index = True or False

In [8]:
# Save it with the index
df.to_csv("newexample.csv", index = True)

In [9]:
# Let's read it again and check a column
new = pd.read_csv("newexample.csv")
new

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [10]:
# Save it without an index
df.to_csv("newexample.csv", index = False)

In [11]:
# Let's read it again and check if a column
new1 = pd.read_csv("newexample.csv")
new1

Unnamed: 0,b,c,d
0,1,2,3
1,5,6,7
2,9,10,11
3,13,14,15


## 2. Excel Files

* Pandas can only read and write in raw data, it is not able to read in macros, visualizations or formulas created inside spreadsheets.
* Pandas treats an Excel Workbook as a dictionary, with the key being the sheet name and the value being the DataFrame representing the sheet itself.
* Using pandas with Excel requires additional libraries.
* Working with Excel Files in Python: https://www.python-excel.org/

In [12]:
#pip install openpyxl
#pip install xlrd

### Excel File Input: read_excel()

In [13]:
df = pd.read_excel("my_excel_file.xlsx") # be careful about "xlsx"
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [14]:
# If you want to read a specific sheeet, write its name
df = pd.read_excel("my_excel_file.xlsx", sheet_name="First_Sheet")
df  # Since there is only one Sheet named "First_Sheet", the dataa did not change

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


##### What if you don't know the sheet name? Or want to run a for loop for certain sheet names? Or want every sheet?
There are several ways to do it: https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file

In [15]:
# If you do not know the sheet names or there are lots of sheets use this code (ExcelFile) and then "sheet_names" attirbute
workbook = pd.ExcelFile("my_excel_file.xlsx")

In [16]:
workbook.sheet_names # there is only one sheet

['First_Sheet']

##### All sheets as a dictionary

In [17]:
# We can read everthing with "sheet_name=None" and set up a dictionary.
# Then we can use the keys in this dictioanry
excel_sheet_dict = pd.read_excel("my_excel_file.xlsx", sheet_name=None)

In [18]:
type(excel_sheet_dict) # it is a dictionary now

dict

In [19]:
# check out the keys to see the sheets
excel_sheet_dict.keys()

dict_keys(['First_Sheet'])

In [20]:
#  To see a sheet, write its name
excel_sheet_dict["First_Sheet"]

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


### Excel File Output: to_excel

In [21]:
# Let's assign the excel file above to a data frame and save it

excel_df = excel_sheet_dict["First_Sheet"]
excel_df.to_excel("newexcel.xlsx", sheet_name="First_Sheet", index=False) # we can save it with a sheet name

# If it is already exist, it is going to overwrite it

## 3. HTML Tables

* Websites display tabular information through the use of HTML **table** tags.
* Pandas has the ability to automtically convert these HTML tables into a DataFrame.
* Not every table in a website is available through HTML tables.
* Some websites may block your computer from scraping the HTML of the site through Pandas. It may be more efficient to **use an API**.
* We may need to install **lxml**, **htmllib5**, and **BeautifulSoup4** to grab data on a website.

In [22]:
# conda install lxml
!pip install lxml



### HTML Input: read_html()

##### Grab all the tables from a Wikipedia Article (World Population) and clean and orginize the information to get a DataFrame.

In [23]:
tables = pd.read_html('https://en.wikipedia.org/wiki/World_population')

In [24]:
# Or use url
url = "https://en.wikipedia.org/wiki/World_population"
tables= pd.read_html(url)

In [25]:
# To see all tables... There are lots of tables on this page
# tables 

In [26]:
# See its length...
len(tables)

#At this time, there are 30 "table tags" on the page.

30

In [27]:
# Grab the first table
tables[0]

Unnamed: 0,Population,1,2,3,4,5,6,7,8,9,10
0,Year,1804,1927,1960,1974,1987,1999,2011,2022,2037,2057
1,Years elapsed,"200,000+",123,33,14,13,12,12,11,15,20


In [28]:
# Population by region table
tables[3]

Unnamed: 0,Region,Density (inhabitants/km2),Population (millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,439,090,595 – India","13,515,000 – Tokyo Metropolis (37,400,000 – Gr..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo (20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow (20,004,000 – Moscow metro..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City (21,650,000 – São ..."
4,Northern America[note 1],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City (23,582,649 – New Y..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[88],N/A[note 2],"00,001,258 – McMurdo Station"


In [29]:
# See the colums of this table
tables[3].columns

Index(['Region', 'Density (inhabitants/km2)', 'Population (millions)',
       'Most populous country', 'Most populous city (metropolitan area)'],
      dtype='object')

In [30]:
# World population, table 6
tables[5]

Unnamed: 0_level_0,#,Most populous countries,2000,2015,2030[A],Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.
Unnamed: 0_level_1,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Unnamed: 0_level_4,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_5,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5
Unnamed: 0_level_6,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_6,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6
Unnamed: 0_level_7,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_7,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7
Unnamed: 0_level_8,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_8,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8
Unnamed: 0_level_9,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_9,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9
Unnamed: 0_level_10,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_10,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10
Unnamed: 0_level_11,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_11,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11
0,,Graphs are unavailable due to technical issues...,,,,
1,1,China[B],1270,1376,1416,
2,2,India,1053,1311,1528,
3,3,United States,283,322,356,
4,4,Indonesia,212,258,295,
5,5,Pakistan,136,208,245,
6,6,Brazil,176,206,228,
7,7,Nigeria,123,182,263,
8,8,Bangladesh,131,161,186,
9,9,Russia,146,146,149,


In [31]:
tables[5].columns

MultiIndex([(                                                                                                      '#', ...),
            (                                                                                'Most populous countries', ...),
            (                                                                                                   '2000', ...),
            (                                                                                                   '2015', ...),
            (                                                                                                '2030[A]', ...),
            ('Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.', ...)],
           )

In [32]:
world_topten = tables[5]

In [33]:
# Let's drop the last row (12), which includes some notes
world_topten = world_topten.drop(12, axis=0)

In [34]:
# Let's drop the fisrt column (#)
world_topten = world_topten.drop("#", axis=1)

  world_topten = world_topten.drop("#", axis=1)


In [35]:
# Drop also the last column
# Let's drop the fisrt column (#)
world_topten = world_topten.drop("Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.", axis=1)

  world_topten = world_topten.drop("Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.", axis=1)


In [36]:
world_topten

Unnamed: 0_level_0,Most populous countries,2000,2015,2030[A]
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Unnamed: 0_level_2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Unnamed: 0_level_3,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Unnamed: 0_level_4,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4
Unnamed: 0_level_5,Unnamed: 1_level_5,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5
Unnamed: 0_level_6,Unnamed: 1_level_6,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6
Unnamed: 0_level_7,Unnamed: 1_level_7,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7
Unnamed: 0_level_8,Unnamed: 1_level_8,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8
Unnamed: 0_level_9,Unnamed: 1_level_9,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9
Unnamed: 0_level_10,Unnamed: 1_level_10,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10
Unnamed: 0_level_11,Unnamed: 1_level_11,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11
0,Graphs are unavailable due to technical issues...,,,
1,China[B],1270.0,1376.0,1416.0
2,India,1053.0,1311.0,1528.0
3,United States,283.0,322.0,356.0
4,Indonesia,212.0,258.0,295.0
5,Pakistan,136.0,208.0,245.0
6,Brazil,176.0,206.0,228.0
7,Nigeria,123.0,182.0,263.0
8,Bangladesh,131.0,161.0,186.0
9,Russia,146.0,146.0,149.0


In [37]:
# Let's rearrange the column names
world_topten.columns = ["Country","2000", "2015", "2030 Est."]

In [38]:
world_topten

Unnamed: 0,Country,2000,2015,2030 Est.
0,Graphs are unavailable due to technical issues...,,,
1,China[B],1270.0,1376.0,1416.0
2,India,1053.0,1311.0,1528.0
3,United States,283.0,322.0,356.0
4,Indonesia,212.0,258.0,295.0
5,Pakistan,136.0,208.0,245.0
6,Brazil,176.0,206.0,228.0
7,Nigeria,123.0,182.0,263.0
8,Bangladesh,131.0,161.0,186.0
9,Russia,146.0,146.0,149.0


In [39]:
# We can drop the firt row, index 0

world_topten = world_topten.drop(0, axis=0)

In [40]:
world_topten

Unnamed: 0,Country,2000,2015,2030 Est.
1,China[B],1270,1376,1416
2,India,1053,1311,1528
3,United States,283,322,356
4,Indonesia,212,258,295
5,Pakistan,136,208,245
6,Brazil,176,206,228
7,Nigeria,123,182,263
8,Bangladesh,131,161,186
9,Russia,146,146,149
10,Mexico,103,127,148


##### Let's try another table and assign a column as an index 

In [41]:
tables[7]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][102],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


In [42]:
# We can use Rank column as an index
tables[7].set_index("Rank")

Unnamed: 0_level_0,Country,Population,Area (km2),Density (pop/km2)
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Singapore,5921231,719,8235
2,Bangladesh,165650475,148460,1116
3,Palestine[note 3][102],5223000,6025,867
4,Taiwan[note 4],23580712,35980,655
5,South Korea,51844834,99720,520
6,Lebanon,5296814,10400,509
7,Rwanda,13173730,26338,500
8,Burundi,12696478,27830,456
9,India,1389637446,3287263,423
10,Netherlands,17400824,41543,419


### HTML Output: to_html()

In [43]:
# Let's name and export the "world_topten table" as HTML
world_topten.to_html("population_table.html", index=False)

In [44]:
# We can  find it in the File and check if it is saved succcessfully.

## 4. SQL Connections

* Pandas can read and write to various SQL engines through the use of a driver and the **sqlalchemy** python library.
* Figure out what SQL Engine you are connecting to:
  [MySQL](https://www.google.com/search?q=mysql+python),
  [PostgreSQL](https://www.google.com/search?q=postgresql+python),
  [MS SQL Server](https://www.google.com/search?q=MSSQLserver+python),
  [Orcale](https://www.google.com/search?q=oracle+python),
  [MongoDB](https://www.google.com/search?q=mongodb+python), etc.
* Install the appropriate Python driver librarry, such as **psycopg2** for PostgreSQL, **pymysql** for MySQL and **pyodbc** for MS SQL Server. For an introduction, see https://realpython.com/python-sql-libraries/
* In this notebook, **SQLite** is used because it is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. See https://docs.python.org/3/library/sqlite3.html
* We will need to install sqlalchemy. To understand how it works, see https://docs.sqlalchemy.org/en/13/core/connections.html

In [45]:
# Install sqlalchemy

!pip install sqlalchemy



##### Create a temporary SQLite database inside the memory

In [46]:
from sqlalchemy import create_engine

In [47]:
temp_db = create_engine("sqlite:///:memory:")

In [48]:
# Let's create a random (integer) data through NumPy
df = pd.DataFrame(data=np.random.randint(0,100,(4,4)), columns=["a", "b", "c", "d"])  # size(4,4)
df

Unnamed: 0,a,b,c,d
0,61,3,74,92
1,17,26,6,67
2,65,3,53,90
3,63,68,30,36


### Write to database: to_sql

In [49]:
# name ans save the table to the database
# our database is "temp_db" that is a temporary database created above
df.to_sql(name="new_table", con=temp_db)   # check with shift+tab 

4

If we try to run the same code again, it will give "Table already exist" error. But we can replace or append. 

![image.png](attachment:b4d1bb1d-f336-493a-946c-fc1b07fc2018.png)

In [50]:
# When we read it later it will display an andditional "index" column. See the example "read_sql"...
# To remove this index in advance, we can write "ndex=False"

# df.to_sql(name="new_table", con=temp_db, index=False)  

### Import data from a database: read_sql()

In [51]:
# call the table and the database
new_df = pd.read_sql(sql="new_table", con=temp_db)
new_df

Unnamed: 0,index,a,b,c,d
0,0,61,3,74,92
1,1,17,26,6,67
2,2,65,3,53,90
3,3,63,68,30,36


In [52]:
# We could use SQL query. For example use SELECT to see "a" and "c" columns
result = pd.read_sql_query(sql="SELECT a,c FROM new_table", con=temp_db)
result

Unnamed: 0,a,c
0,61,74
1,17,6
2,65,53
3,63,30


##### Let's save one of the tables that we extracted from Wikipedia: "World Population". For example, we can save "tables 7" to our "temp_db".

In [53]:
pop = tables[7]

In [54]:
# save it as "populations" table
pop.to_sql("populations", con=temp_db, index=False) # without additional index column

10

##### Read it from the SQL database

In [57]:
pd.read_sql("populations", con=temp_db)

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][102],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


### read_sql_query

##### Use SELECT and read only "Country" and "Population" columns

In [58]:
pd.read_sql_query(sql="SELECT Country, Population FROM populations", con=temp_db)

Unnamed: 0,Country,Population
0,Singapore,5921231
1,Bangladesh,165650475
2,Palestine[note 3][102],5223000
3,Taiwan[note 4],23580712
4,South Korea,51844834
5,Lebanon,5296814
6,Rwanda,13173730
7,Burundi,12696478
8,India,1389637446
9,Netherlands,17400824
