In [1]:
import numpy as np
import pandas as pd

In [2]:
myframe = pd.read_csv('student1.csv')

In [3]:
myframe

Unnamed: 0,Rollno,Name,Percent
0,101,'Ram',86.5
1,102,'Shyam',77.6
2,103,'Anil',94.3


CSV files are tabulated data in which the values on the same column are separated by commas. But since CSV files are considered text files, you can also use the read_table() function, but specifying the delimiter.

In [4]:
myframe2 = pd.read_table('student1.csv', sep=',')
myframe2

Unnamed: 0,Rollno,Name,Percent
0,101,'Ram',86.5
1,102,'Shyam',77.6
2,103,'Anil',94.3


In the example you just saw, you can notice that in the CSV file, headers to identify all the columns are
in the first row. But this is not a general case, it often happens that the tabulated data begin directly from the first line.

In this case, pandas treats the first line of the data as the column names which poses error while reading a dataset. 

To overcome this problem, we can pass header arguement to the read_csv() method

In [5]:
myframe = pd.read_csv('student2.csv')

In [6]:
myframe

Unnamed: 0,101,'Ram',86.5
0,102,'Shyam',77.6
1,103,'Anil',94.3


Look at the first row, the data entered in the table is treated as the column names. To overcome this:

In [7]:
myframe = pd.read_csv('student2.csv', header=None)

In [8]:
myframe

Unnamed: 0,0,1,2
0,101,'Ram',86.5
1,102,'Shyam',77.6
2,103,'Anil',94.3


In [9]:
myframe = pd.read_csv('student2.csv', names=['Rollno', 'Names', 'Percent'])

In [10]:
myframe

Unnamed: 0,Rollno,Names,Percent
0,101,'Ram',86.5
1,102,'Shyam',77.6
2,103,'Anil',94.3


In [11]:
dframe = pd.read_csv('mycsv3.csv')

In [12]:
dframe

Unnamed: 0,color,status,item1,item2,item3
0,black,up,3,4,6
1,black,down,2,6,7
2,white,up,5,5,5
3,white,down,3,3,2
4,white,left,1,2,1
5,red,up,2,2,2
6,red,down,1,1,4


To convert the columns into indices or hierarchial indices

In [13]:
dframe = pd.read_csv('mycsv3.csv', index_col='color')

In [14]:
dframe

Unnamed: 0_level_0,status,item1,item2,item3
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4


OR, hierarchial indexing

In [15]:
dframe = pd.read_csv('mycsv3.csv', index_col=['color', 'status'])

In [16]:
dframe

Unnamed: 0_level_0,Unnamed: 1_level_0,item1,item2,item3
color,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4


In [17]:
dframe = pd.read_csv('mycsv3.csv', index_col=['status', 'color'])

In [18]:
dframe

Unnamed: 0_level_0,Unnamed: 1_level_0,item1,item2,item3
status,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
up,black,3,4,6
down,black,2,6,7
up,white,5,5,5
down,white,3,3,2
left,white,1,2,1
up,red,2,2,2
down,red,1,1,4


## Using RegExp for Parsing TXT Files

In other cases, it is possible that the files on which to parse the data do not show separators well defined as
a comma or a semicolon. In these cases, the regular expressions come to our aid. In fact, you can specify a
regexp within the read_table() function using the sep option.


Table  Metacharacters

|wild card| meaning|
|---------|--------|
|.| single character, except newline|
|\d| digit|
|\D| non-digit character|
|\s| whitespace character|
|\S| non-whitespace character|
|\n| new line character|
|\t| tab character|
|\uxxxx| unicode character specified by the hexadecimal number xxxx|

In [19]:
pd.read_table('text.txt')

Unnamed: 0,white red blue green
0,1 5 2 3
1,2 7 8 5
2,3 3 6 7


# Reading TXT Files into Parts or Partially


if you want to read only a portion of the file, you can explicitly specify the number of
lines on which to parse. Thanks to the nrows and **skiprows** options, you can select the starting line
**n (n = SkipRows)** and the lines to be read after it **(nrows = i)**.


In [27]:
df = pd.read_csv('mycsv5.csv')

In [28]:
df

Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In [29]:
df = pd.read_csv('mycsv5.csv', skiprows=[2], nrows=3, header=None)

In [30]:
df

Unnamed: 0,0,1,2,3,4
0,white,red,blue,green,animal
1,1,5,2,3,cat
2,3,3,6,7,horse


- skiprows=[2]: This parameter skips reading the row(s) specified in the list. In this case, [2] means it skips the third row (Python uses 0-based indexing, so row 2 is the third row).
- nrows=3: Specifies the number of rows to read from the CSV file after skipping the specified rows. In this case, it reads the next 3 rows after skipping the third row.

**chunksize=** arguement

In [34]:
pd.read_csv('mycsv5.csv')

Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In [36]:
out = pd.Series()
i = 0
pieces = pd.read_csv('mycsv5.csv', chunksize=3)

In [35]:
for piece in pieces:
    out.set_value(i, piece['white'].sum())
    i += 1

AttributeError: 'Series' object has no attribute 'set_value'

# Writing data on CSV

In [37]:
df = {'Name': ['Prakhar', 'Sachin', 'Raj'], 'Rollno': [22, 33, 44], 'Percent': [90.0, 67.8, 89.4]}

In [38]:
myframe = pd.DataFrame(df)

In [40]:
myframe

Unnamed: 0,Name,Rollno,Percent
0,Prakhar,22,90.0
1,Sachin,33,67.8
2,Raj,44,89.4


In [41]:
myframe.to_csv('mycsv6.csv')

In [42]:
pd.read_csv('mycsv6.csv')

Unnamed: 0.1,Unnamed: 0,Name,Rollno,Percent
0,0,Prakhar,22,90.0
1,1,Sachin,33,67.8
2,2,Raj,44,89.4


As you can see from the previous example, when you make the writing of a data frame to a file, by
default both indexes and columns are marked on the file. This default behavior can be changed by placing
the two options **index** and **header** set to False

In [47]:
myframe.to_csv('mycsv7.csv', header=False, index=False)

In [48]:
pd.read_csv('mycsv7.csv')

Unnamed: 0,Prakhar,22,90.0
0,Sachin,33,67.8
1,Raj,44,89.4


One thing to take into account when making the writing of files is that NaN values present in a data
structure are shown as empty fields in the file

In [49]:
data = {'Name': ['Prakhar', 'Sachin', 'Raj', np.nan], 'Rollno': [22, np.nan, 44, 68], 'Percent': [90.0, 67.8, np.nan, 89.4]}

In [50]:
myframe = pd.DataFrame(data)
myframe

Unnamed: 0,Name,Rollno,Percent
0,Prakhar,22.0,90.0
1,Sachin,,67.8
2,Raj,44.0,
3,,68.0,89.4


In [51]:
myframe.to_csv('mycsv8.csv')

In [52]:
pd.read_csv('mycsv8.csv')

Unnamed: 0.1,Unnamed: 0,Name,Rollno,Percent
0,0,Prakhar,22.0,90.0
1,1,Sachin,,67.8
2,2,Raj,44.0,
3,3,,68.0,89.4


When this file is opened on notepad, the nan values are replaced with empty values.

To handle this:

Replace this empty field with a value to your liking using the na_rep option in the to_csv()
function. Common values may be NULL, 0, or the same NaN

In [53]:
myframe.to_csv('mycsv9.csv', na_rep=0)

In [54]:
pd.read_csv('mycsv9.csv')

Unnamed: 0.1,Unnamed: 0,Name,Rollno,Percent
0,0,Prakhar,22.0,90.0
1,1,Sachin,0.0,67.8
2,2,Raj,44.0,0.0
3,3,0,68.0,89.4


# Reading&Writing HTML files

Many websites have now adopted the HTML5 format, to avoid any issues of missing modules and
error messages its recommend strongly to install the module html5lib.

In [55]:
pip install html5lib

Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
     ---------------------------------------- 0.0/112.2 kB ? eta -:--:--
     --- ------------------------------------ 10.2/112.2 kB ? eta -:--:--
     ---------- -------------------------- 30.7/112.2 kB 325.1 kB/s eta 0:00:01
     ------------- ----------------------- 41.0/112.2 kB 279.3 kB/s eta 0:00:01
     ------------------------------ ------ 92.2/112.2 kB 476.3 kB/s eta 0:00:01
     -------------------------------- --- 102.4/112.2 kB 454.0 kB/s eta 0:00:01
     -------------------------------- --- 102.4/112.2 kB 454.0 kB/s eta 0:00:01
     ------------------------------------ 112.2/112.2 kB 310.8 kB/s eta 0:00:00
Collecting webencodings (from html5lib)
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [57]:
pip install --upgrade pip


Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl.metadata (3.5 kB)
Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
    --------------------------------------- 0.0/2.1 MB 262.6 kB/s eta 0:00:08
    --------------------------------------- 0.0/2.1 MB 245.8 kB/s eta 0:00:09
   -- ------------------------------------- 0.1/2.1 MB 504.4 kB/s eta 0:00:04
   --- ------------------------------------ 0.2/2.1 MB 614.4 kB/s eta 0:00:04
   --- ------------------------------------ 0.2/2.1 MB 655.1 kB/s eta 0:00:03
   ------ --------------------------------- 0.3/2.1 MB 855.7 kB/s eta 0:00:03
   -------- ------------------------------- 0.4/2.1 MB 1.1 MB/s eta 0:00:02
   -------- ------------------------------- 0.5/2.1 MB 

**Writing data in HTML**

In [58]:
frame = pd.DataFrame(np.arange(4).reshape(2,2))

In [59]:
frame

Unnamed: 0,0,1
0,0,1
1,2,3


In [62]:
print(frame.to_html())

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>


In [63]:
frame = pd.DataFrame( np.random.random((4,4)), index = ['white','black','red','blue'], columns = ['up','down','right','left'])


In [64]:
frame

Unnamed: 0,up,down,right,left
white,0.995326,0.777191,0.192618,0.695189
black,0.581631,0.605783,0.496921,0.618061
red,0.236164,0.313739,0.576977,0.227948
blue,0.649302,0.043073,0.691099,0.702898


Writing this dataframe on a html page

Now you focus on writing an HTML page through the generation of a string. This is a simple and trivial
example, but it is very useful to understand and to test the functionality of pandas directly on the web browser.
First of all we create a string that contains the code of the HTML page.


In [65]:
s = ['<HTML>']
s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html = ''.join(s)


Now that all the listing of the HTML page is contained within the variable html, you can write directly
on the file that will be called **myFrame.html**:


In [66]:
html_file = open('myfile.html', 'w')
html_file.write(html)
html_file.close()

Now in your working directory will be a new HTML file, myFrame.html.
Run this file on live server and observe.

# Reading Data from an HTML File

The read_html() function returns a list of DataFrame even if there is only one table.
As regards the source to be subjected to parsing, this can be of different types. 

In [67]:
web_frames = pd.read_html('myfile.html')

ImportError: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.

In [68]:
pip install lxml

Collecting lxmlNote: you may need to restart the kernel to use updated packages.





  Downloading lxml-5.0.0-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Downloading lxml-5.0.0-cp312-cp312-win_amd64.whl (3.9 MB)
   ---------------------------------------- 0.0/3.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.9 MB 187.9 kB/s eta 0:00:21
   ---------------------------------------- 0.0/3.9 MB 178.6 kB/s eta 0:00:22
    --------------------------------------- 0.1/3.9 MB 327.3 kB/s eta 0:00:12
   - -------------------------------------- 0.2/3.9 MB 546.6 kB/s eta 0:00:07
   -- ------------------------------------- 0.3/3.9 MB 749.3 kB/s eta 0:00:05
   --- ------------------------------------ 0.4/3.9 MB 955.7 kB/s eta 0:00:04
   ---- ----------------------------------- 0.5/3.9 MB 1.1 

In [69]:
web_frames = pd.read_html('myfile.html')

In [70]:
web_frames

[  Unnamed: 0        up      down     right      left
 0      white  0.995326  0.777191  0.192618  0.695189
 1      black  0.581631  0.605783  0.496921  0.618061
 2        red  0.236164  0.313739  0.576977  0.227948
 3       blue  0.649302  0.043073  0.691099  0.702898]

The list of dataframes that read_html() throws is called **Web Frames**

In [75]:
ranking = pd.read_html('https://afd.calpoly.edu/web/sample-tables')

The same operation can be run on any web page that has one or more tables.

In [76]:
ranking[0]

Unnamed: 0,Description,Date,Location
0,Academic Senate Meeting,"May 25, 2205",Building 99 Room 1
1,Commencement Meeting,"December 15, 2205",Building 42 Room 10
2,Dean's Council,"February 1, 2206",Building 35 Room 5
3,Committee on Committees,"March 3, 2206",Building 1 Room 201
4,"Lorem ipsum dolor sit amet, consectetuer adipi...","Lorem ipsum dolor sit amet, consectetuer adipi...","Lorem ipsum dolor sit amet, consectetuer adipi..."
5,Lorem ipsum dolor,Lorem ipsum dolor,Lorem ipsum dolor


In [77]:
ranking[1]

Unnamed: 0,Name,Telephone,Email,Office
0,Dr. Sally,555-1234,sally@calpoly.edu,12-34
1,Dr. Steve,555-5678,steve@calpoly.edu,56-78
2,Dr. Kathy,555-9012,kathy@calpoly.edu,90-123


Thus, we can direcly import data from a website which contains data in tabular form