In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Session 5 - Python for Data Cleaning using built-in Pandas methods


<img src="img/company-logo.png" width=120 height=120 align="right">

Author: Prof. Manoel Gadi

Contact: manoelgadi@gmail.com

Teaching Web: http://mfalonso.pythonanywhere.com

Linkedin: https://www.linkedin.com/in/manoel-gadi-97821213/

Github: https://github.com/manoelgadi

Last revision: 27/October/2022


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('datasets/BL-Flickr-Images-Book.csv')
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


# Removing unnecessary columns

Often you will find that not all data categories in a dataset are useful to you. For example, you might have a dataset that contains student information (name, grade, standard, parent names, and address) but want to focus on analyzing student grades.

In this case, the address categories or names of the parents are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially bog down the runtime as well.

Pandas provides a handy way to remove unwanted columns or rows from a DataFrame with the drop() function. Let's look at a simple example where we drop multiple columns from a DataFrame.

First, let's create a DataFrame from the CSV file 'BL-Flickr-Images-Book.csv'. In the examples below, we pass a relative path to pd.read_csv, which means all the datasets are in a folder called Datasets in our current working directory:

In [4]:
df[df.describe().columns].describe()

Unnamed: 0,Identifier,Corporate Author,Corporate Contributors,Engraver
count,8287.0,0.0,0.0,0.0
mean,2017344.0,,,
std,1190379.0,,,
min,206.0,,,
25%,915787.5,,,
50%,2043707.0,,,
75%,3047430.0,,,
max,4160339.0,,,


In [5]:
for item in df.describe().columns:
    print(item)
    df[item] = df[item].fillna(0.0)

Identifier
Corporate Author
Corporate Contributors
Engraver


In [6]:
df[df.describe().columns].describe()

Unnamed: 0,Identifier,Corporate Author,Corporate Contributors,Engraver
count,8287.0,8287.0,8287.0,8287.0
mean,2017344.0,0.0,0.0,0.0
std,1190379.0,0.0,0.0,0.0
min,206.0,0.0,0.0,0.0
25%,915787.5,0.0,0.0,0.0
50%,2043707.0,0.0,0.0,0.0
75%,3047430.0,0.0,0.0,0.0
max,4160339.0,0.0,0.0,0.0


In [7]:
non_numeric_columns = list(set(df.columns)-set(df.describe().columns)) # Creating a list of non-numeric columns (text)

In [8]:
for item in non_numeric_columns:
    print(item)
    df[item] = df[item].fillna("")

Title
Edition Statement
Issuance type
Publisher
Place of Publication
Date of Publication
Shelfmarks
Author
Former owner
Contributors
Flickr URL


In [9]:
df

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8282,4158088,,London,1838,,"The Parochial History of Cornwall, founded on,...","GIDDY, afterwards GILBERT, Davies.","BOASE, Henry Samuel.|HALS, William.|LYSONS, Da...",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...
8283,4158128,,Derby,"1831, 32",M. Mozley & Son,The History and Gazetteer of the County of Der...,"GLOVER, Stephen - of Derby","NOBLE, Thomas.",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...
8284,4159563,,London,[1806]-22,T. Cadell and W. Davies,Magna Britannia; being a concise topographical...,"LYSONS, Daniel - M.A., F.R.S., and LYSONS (Sam...","GREGSON, Matthew.|LYSONS, Samuel - F.R.S",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 19...
8285,4159587,,Newcastle upon Tyne,1834,Mackenzie & Dent,"An historical, topographical and descriptive v...","Mackenzie, E. (Eneas)","ROSS, M. - of Durham",0.0,0.0,,0.0,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...


In [10]:
df['Contributors'].value_counts().to_excel("./datasets/Contributors.xlsx")

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be useful to the library but isn't very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Type of issue and Marks.

We can drop these columns as follows:

In [11]:
to_drop = ['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1)
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


Above, we defined a list containing the names of all the columns we want to remove. Next, we call the drop() function on our object, passing the inplace parameter to True and the axis parameter to 1. This tells Pandas that we want the changes to be made directly to our object, and to look up the values . which will be placed in the columns of the object.

## Set the index of the dataset

Alternatively, we can also remove the columns by passing them directly to the columns parameter instead of separately specifying the labels to be removed and the axis where Pandas should look for the labels:

In [12]:
df.index = range(0,len(df))

In [13]:
df

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
...,...,...,...,...,...,...,...
8282,4158088,London,1838,,"The Parochial History of Cornwall, founded on,...","GIDDY, afterwards GILBERT, Davies.",http://www.flickr.com/photos/britishlibrary/ta...
8283,4158128,Derby,"1831, 32",M. Mozley & Son,The History and Gazetteer of the County of Der...,"GLOVER, Stephen - of Derby",http://www.flickr.com/photos/britishlibrary/ta...
8284,4159563,London,[1806]-22,T. Cadell and W. Davies,Magna Britannia; being a concise topographical...,"LYSONS, Daniel - M.A., F.R.S., and LYSONS (Sam...",http://www.flickr.com/photos/britishlibrary/ta...
8285,4159587,Newcastle upon Tyne,1834,Mackenzie & Dent,"An historical, topographical and descriptive v...","Mackenzie, E. (Eneas)",http://www.flickr.com/photos/britishlibrary/ta...


In [14]:
df.set_index('Identifier', inplace = True)
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


In [15]:
df.index[0:10]

Int64Index([206, 216, 218, 472, 480, 481, 519, 667, 874, 1143], dtype='int64', name='Identifier')

In [16]:
df.columns = ['PlaceOfPublication', 'DateOfPublication', 'Publisher', 'Title',
       'Author', 'FlickrURL']

In [17]:
df.rename(columns={'FlickrURL':'URL'},inplace=True)

In [18]:
a = [10,14,15,17]
b = {'nombre':'Manoel', 'edad':43}

In [19]:
df.columns

Index(['PlaceOfPublication', 'DateOfPublication', 'Publisher', 'Title',
       'Author', 'URL'],
      dtype='object')

In [20]:
df['DateOfPublication'].head(25)

Identifier
206            1879 [1878]
216                   1868
218                   1869
472                   1851
480                   1857
481                   1875
519                   1872
667                       
874                   1676
1143                  1679
1280                  1802
1808                  1859
1905                  1888
1929           1839, 38-54
2836                  1897
2854                  1865
2956               1860-63
2957                  1873
3017                  1866
3131                  1899
4598                  1814
4884                  1820
4976                  1800
5382    1847, 48 [1846-48]
5385               [1897?]
Name: DateOfPublication, dtype: object

# Fixing some fields in the data according to their types

In [21]:
df.dtypes

PlaceOfPublication    object
DateOfPublication     object
Publisher             object
Title                 object
Author                object
URL                   object
dtype: object

In [22]:
df.loc[1905:, 'DateOfPublication'].value_counts()

                180
1897            157
1896            150
1893            130
1892            127
               ... 
1892, [1891]      1
[1895-97.]        1
1838-39           1
1854-55           1
1834-43           1
Name: DateOfPublication, Length: 1149, dtype: int64

In [23]:
a = ' Mi String  is      horrible '

In [24]:
a.lower().strip()

'mi string  is      horrible'

In [25]:
df['DateOfPublication'].str.strip()

Identifier
206        1879 [1878]
216               1868
218               1869
472               1851
480               1857
              ...     
4158088           1838
4158128       1831, 32
4159563      [1806]-22
4159587           1834
4160339        1834-43
Name: DateOfPublication, Length: 8287, dtype: object

In [26]:
extr = df['DateOfPublication'].str.extract(r'^(\d{4})', expand=False)

The __regular expression__ above is intended to find four digits at the beginning of a string, which is sufficient for our case. The above is a raw string (meaning a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d stands for any digit, and {4} repeats this rule four times. The ^ character matches the beginning of a string, and the parentheses indicate a capturing group, which tells Pandas that we want to extract that part of the regular expression. (We want ^ to avoid cases where [ begins in the string.)

Let's see what happens when we run this regular expression on our dataset:

In [27]:
extr

Identifier
206        1879
216        1868
218        1869
472        1851
480        1857
           ... 
4158088    1838
4158128    1831
4159563     NaN
4159587    1834
4160339    1834
Name: DateOfPublication, Length: 8287, dtype: object

# Convert objects that should be numeric to numeric

Technically, this column still has an object type, but we can easily get its numeric version with pd.to_numeric:

In [28]:
df.dtypes

PlaceOfPublication    object
DateOfPublication     object
Publisher             object
Title                 object
Author                object
URL                   object
dtype: object

In [29]:
df['Date of Publication'] = pd.to_numeric(extr)

In [30]:
df.dtypes

PlaceOfPublication      object
DateOfPublication       object
Publisher               object
Title                   object
Author                  object
URL                     object
Date of Publication    float64
dtype: object

In [31]:
df.head(2).T

Identifier,206,216
PlaceOfPublication,London,London; Virtue & Yorston
DateOfPublication,1879 [1878],1868
Publisher,S. Tinsley & Co.,Virtue & Co.
Title,Walter Forbes. [A novel.] By A. A,All for Greed. [A novel. The dedication signed...
Author,A. A.,"A., A. A."
URL,http://www.flickr.com/photos/britishlibrary/ta...,http://www.flickr.com/photos/britishlibrary/ta...
Date of Publication,1879.0,1868.0
