# An introduction to `pandas`


'Give someone a dataset, feed them for a day. Teach someone to create a dataset, feed them for a lifetime' -- me


## What is `pandas` and why does it matter?

`pandas` is a Python library that provides a couple new data structures that give users much more flexibility when reading in or exporting data. The main way that we will interact with `pandas` is through the pre-defined `DataFrame` object. `DataFrame`s are a very different data structure than what you may be used to from this class or other classes. It is most similar to a SQL database or an Excel spreadsheet in that it expects 'tabular data' (data that can be arranged in a table). 

`pandas` is very unique in that it is not just written in Python but also an intermediate language between C and Python called `Cython`. This feature means that `pandas` has all of the ease of a Python library, while also utilizing all of the speed and efficiency of C. Thus, `pandas` has become a critical component in data science, data analytics and, of course, natural language process and the digital humanities. 

In [1]:
import pandas as pd

You will almost always see `pandas` imported like this. Although the beginning looks normal, this `as` keyword tells Python we intend to use the shorthand `pd` to stand in for `pandas`. Importantly, we cannot use `pandas`, if we import it like this. We can only use `pd`. See below:

In [2]:
pd

<module 'pandas' from '/Users/pnadel01/miniconda3/lib/python3.9/site-packages/pandas/__init__.py'>

In [3]:
pandas

NameError: name 'pandas' is not defined

We get a `NameError` because we told Python to import `pandas` as `pd`, so `pd` is the only name that works.

## Sample datasets for your final projects

For your final project, you'll need to use your a new dataset. At your disposal are a couple datasets that I have put together for you: https://github.com/pnadelofficial/datasets. They are in varying states of cleanness and ease of use in order to simulate a dataset you might find in the wild. Feel free to use any of these for your final project or the make one yourself.

Today, I'll walk you through using the `lyrics_app` and we'll also get a tour of `pandas` and experiment with data cleaning. 

In [4]:
!pip install wget
import wget
import os

# downloading lyrics app
if not os.path.isfile('lyrics.py'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/datasets/main/lyrics_app/lyrics.py')
    
# raise your hand if you get an error here...



This file takes in an artist name and an album name and will return a comma separated values file or `csv` of all of the lyrics in that album. But you must use the command line. So let's see how we can do that from this notebook. 

In [5]:
!python lyrics.py 'The Beatles' 'Revolver'
# this will save a file called 'The Beatles-Revolver_lyrics.csv' in my directory
# will take around a minute

Searching for "Revolver" by The Beatles...
Wrote Lyrics_Revolver.json.
Data saved
Data loaded


In [5]:
# read in as a DataFrame
revolver = pd.read_csv('The Beatles-Revolver_lyrics.csv')
revolver

Unnamed: 0,song_title,lyrics
0,Taxman,Taxman Lyrics[Intro: Paul McCartney & George H...
1,Eleanor Rig,"Eleanor Rigby Lyrics[Intro: Paul McCartney, Jo..."
2,I'm Only Sleeping,I’m Only Sleeping Lyrics[Verse 1]\nWhen I wake...
3,Love You To,Love You To Lyrics[Intro]\n\n[Verse 1]\nEach d...
4,"Here, There and Everywhere","Here, There and Everywhere Lyrics[Intro]\nTo l..."
5,Yellow Submarine,Yellow Submarine Lyrics[Verse 1: Ringo Starr]\...
6,She Said She Said,"She Said She Said Lyrics[Verse 1]\nShe said, ""..."
7,Good Day Sunshine,Good Day Sunshine Lyrics[Chorus]\nGood day sun...
8,And Your Bird Can Sing,And Your Bird Can Sing Lyrics[Verse 1]\nYou te...
9,For No One,"For No One Lyrics[Verse 1]\nYour day breaks, y..."


In [6]:
type(revolver)

pandas.core.frame.DataFrame

In [9]:
revolver?

This is a `DataFrame`. As we see above, we can create one from a `csv` file, but we will also see how we can take a `dict` or `list` and convert it into a `DataFrame` in our next class.

Now let's take a look at what the output is. We can do that by first selecting the column we want to look at, in this case `lyrics`, and then selecting the row, in this case the first row (index = 0).

In `pandas`, there are two ways of selecting a column: 
* `revolver['lyrics']`
* `revolver.lyrics`

These will not just produce the same thing, but actually refer to the same process in computer memory.

To select a row, we just need to put the index position of the row we want in square brackets after the column, just like in a list. See below:

In [7]:
# selecting a column
revolver.lyrics # equivalent to revolver['lyrics']

0     Taxman Lyrics[Intro: Paul McCartney & George H...
1     Eleanor Rigby Lyrics[Intro: Paul McCartney, Jo...
2     I’m Only Sleeping Lyrics[Verse 1]\nWhen I wake...
3     Love You To Lyrics[Intro]\n\n[Verse 1]\nEach d...
4     Here, There and Everywhere Lyrics[Intro]\nTo l...
5     Yellow Submarine Lyrics[Verse 1: Ringo Starr]\...
6     She Said She Said Lyrics[Verse 1]\nShe said, "...
7     Good Day Sunshine Lyrics[Chorus]\nGood day sun...
8     And Your Bird Can Sing Lyrics[Verse 1]\nYou te...
9     For No One Lyrics[Verse 1]\nYour day breaks, y...
10    Doctor Robert Lyrics[Verse 1]\nRing my friend,...
11    I Want to Tell You Lyrics[Verse 1]\nI want to ...
12    Got to Get You into My Life Lyrics[Verse 1]\nI...
13    Tomorrow Never Knows Lyrics[Verse 1]\nTurn off...
Name: lyrics, dtype: object

In [8]:
# indexing just like in a list
revolver.lyrics[0]

"Taxman Lyrics[Intro: Paul McCartney & George Harrison]\nOne, two, three, four\nOne, two—  (One, two, three, four!)\n\n[Verse 1]\nLet me tell you how it will be\nThere's one for you, nineteen for me\nCause I'm the taxman\nYeah, I'm the taxman\n\n[Verse 2]\nShould five percent appear too small\nBe thankful, I don't take it all\nCause I'm the taxman\nYeah, I'm the taxman\n[Bridge]\n(If you drive a car, car) I'll tax the street\n(If you try to sit, sit) I'll tax your seat\n(If you get too cold, cold) I'll tax the heat\n(If you take a walk, walk) I'll tax your feet\nTaxman\n\n[Guitar Solo: Paul McCartney]\n\n[Refrain]\nCause I'm the taxman\nYeah, I'm the taxman\n\n[Verse 3]\nDon't ask me what I want it for\n(Haha, Mr. Wilson)\nIf you don't want to pay some more\n(Haha, Mr. Heath)\nCause I'm the taxman\nYeah, I'm the taxman\n\n[Verse 4]\nNow my advice for those who die\n(Taxman!)\nDeclare the pennies on your eyes\n(Taxman!)\nCause I'm the taxman\nYeah, I'm the taxman\nYou might also like[Ou

This output looks pretty messy. In this case, there is a lot of text that indicates what's going on in the song itself, but all we want is the text of the lyrics. We're going to have to 'clean' it.

If you want to continue in computer science, data science or the digital humanities, data cleaning is something you will have to confront. No dataset is perfectly clean and prepared for the task you want to do with it. Instead we can use `pandas` to manipulate the data into any form that is convenient for us to work with.  

In [10]:
taxman = revolver.lyrics[0] # make a variable for later

Below I'm going to use a package called `re` or regular expressions. This is a very old programming/text analysis language which we can use in Python to navigate through strings easier. It may look intimidating now, but once you use it more, it will make a lot of sense and it will be clear why it's been around almost as long as computing itself.

Nothing is a substitute for learning regex, but I find it difficult to introduce to students because it has a high learning curving. This website: https://www.autoregex.xyz/ is a great tool to get started with regex. It uses state of the art sentence transformers (something we may learn about at the very end of this course) to convert an English prompt into a regular expression. Let's try it now for selecting all of the text between square brackets. 

In [12]:
# let's get rid of all of the text between square brackets
import re 
re.sub('(\[.*\])','',taxman)

"Taxman Lyrics\nOne, two, three, four\nOne, two—  (One, two, three, four!)\n\n\nLet me tell you how it will be\nThere's one for you, nineteen for me\nCause I'm the taxman\nYeah, I'm the taxman\n\n\nShould five percent appear too small\nBe thankful, I don't take it all\nCause I'm the taxman\nYeah, I'm the taxman\n\n(If you drive a car, car) I'll tax the street\n(If you try to sit, sit) I'll tax your seat\n(If you get too cold, cold) I'll tax the heat\n(If you take a walk, walk) I'll tax your feet\nTaxman\n\n\n\n\nCause I'm the taxman\nYeah, I'm the taxman\n\n\nDon't ask me what I want it for\n(Haha, Mr. Wilson)\nIf you don't want to pay some more\n(Haha, Mr. Heath)\nCause I'm the taxman\nYeah, I'm the taxman\n\n\nNow my advice for those who die\n(Taxman!)\nDeclare the pennies on your eyes\n(Taxman!)\nCause I'm the taxman\nYeah, I'm the taxman\nYou might also like\nAnd you're working for no one but me (Taxman!)10Embed"

Let's break that down:
* re.sub: This is the substitute function in `re`. It works very similarly to `replace` in Python. It takes three arguments which are below
* Argument 1: a pattern or what's sometimes called a 'regular expression'. This is just a string, but to the `re` package it means something more. Taking a closer look at this one:

`(`: opens capture group

`\[`: looks for the `[` character, as we know the square bracket has meaning in Python (and other programming languages) so we need to add the backslash to tell `re` that we are looking for the square bracket and not creating a Python list

`.*`: looks for any character as many times as possible

`\]`: again we have to use the backslash to specify that we are looking for the end bracket character

`)`: ends capture group

* Argument 2: a replacement string, that is what do we want to substitute our pattern with
* Argument 3: the string we want to change

Another great regex tool is https://regex101.com/ which allows you to test different patterns to make sure that you are matching what you want to be matching. Both this site and the other one from above are great tools for getting started with regex, but no website or algorithm can replace the practice needed to master regex.

In [14]:
# what else? well, let's get rid of the 'Taxman Lyrics' at the beginning
# we see it's followed by a new line (\n), so we can split on that character
# this has the added benefit of removing all of the new line characters (\n)
re.sub('(\[.*\])','',taxman).split('\n')[1:]

['One, two, three, four',
 'One, two—  (One, two, three, four!)',
 '',
 '',
 'Let me tell you how it will be',
 "There's one for you, nineteen for me",
 "Cause I'm the taxman",
 "Yeah, I'm the taxman",
 '',
 '',
 'Should five percent appear too small',
 "Be thankful, I don't take it all",
 "Cause I'm the taxman",
 "Yeah, I'm the taxman",
 '',
 "(If you drive a car, car) I'll tax the street",
 "(If you try to sit, sit) I'll tax your seat",
 "(If you get too cold, cold) I'll tax the heat",
 "(If you take a walk, walk) I'll tax your feet",
 'Taxman',
 '',
 '',
 '',
 '',
 "Cause I'm the taxman",
 "Yeah, I'm the taxman",
 '',
 '',
 "Don't ask me what I want it for",
 '(Haha, Mr. Wilson)',
 "If you don't want to pay some more",
 '(Haha, Mr. Heath)',
 "Cause I'm the taxman",
 "Yeah, I'm the taxman",
 '',
 '',
 'Now my advice for those who die',
 '(Taxman!)',
 'Declare the pennies on your eyes',
 '(Taxman!)',
 "Cause I'm the taxman",
 "Yeah, I'm the taxman",
 'You might also like',
 "And you'r

In [12]:
# turn the list back into a string
taxman_clean = ' '.join(re.sub('(\[.*\])','',taxman).split('\n')[1:])
taxman_clean

"One, two, three, four One, two—  (One, two, three, four!)   Let me tell you how it will be There's one for you, nineteen for me Cause I'm the taxman Yeah, I'm the taxman   Should five percent appear too small Be thankful, I don't take it all Cause I'm the taxman Yeah, I'm the taxman  (If you drive a car, car) I'll tax the street (If you try to sit, sit) I'll tax your seat (If you get too cold, cold) I'll tax the heat (If you take a walk, walk) I'll tax your feet Taxman     Cause I'm the taxman Yeah, I'm the taxman   Don't ask me what I want it for (Haha, Mr. Wilson) If you don't want to pay some more (Haha, Mr. Heath) Cause I'm the taxman Yeah, I'm the taxman   Now my advice for those who die (Taxman!) Declare the pennies on your eyes (Taxman!) Cause I'm the taxman Yeah, I'm the taxman You might also like And you're working for no one but me (Taxman!)10Embed"

In [13]:
# the last thing is the weird '10Embed' thing
# let's make a new regular expression for that
re.sub('\d+Embed','',taxman_clean) # \d means any digit, + means any amount of times

"One, two, three, four One, two—  (One, two, three, four!)   Let me tell you how it will be There's one for you, nineteen for me Cause I'm the taxman Yeah, I'm the taxman   Should five percent appear too small Be thankful, I don't take it all Cause I'm the taxman Yeah, I'm the taxman  (If you drive a car, car) I'll tax the street (If you try to sit, sit) I'll tax your seat (If you get too cold, cold) I'll tax the heat (If you take a walk, walk) I'll tax your feet Taxman     Cause I'm the taxman Yeah, I'm the taxman   Don't ask me what I want it for (Haha, Mr. Wilson) If you don't want to pay some more (Haha, Mr. Heath) Cause I'm the taxman Yeah, I'm the taxman   Now my advice for those who die (Taxman!) Declare the pennies on your eyes (Taxman!) Cause I'm the taxman Yeah, I'm the taxman You might also like And you're working for no one but me (Taxman!)"

In [14]:
taxman_clean = re.sub('\d+Embed','',taxman_clean)
taxman_clean

"One, two, three, four One, two—  (One, two, three, four!)   Let me tell you how it will be There's one for you, nineteen for me Cause I'm the taxman Yeah, I'm the taxman   Should five percent appear too small Be thankful, I don't take it all Cause I'm the taxman Yeah, I'm the taxman  (If you drive a car, car) I'll tax the street (If you try to sit, sit) I'll tax your seat (If you get too cold, cold) I'll tax the heat (If you take a walk, walk) I'll tax your feet Taxman     Cause I'm the taxman Yeah, I'm the taxman   Don't ask me what I want it for (Haha, Mr. Wilson) If you don't want to pay some more (Haha, Mr. Heath) Cause I'm the taxman Yeah, I'm the taxman   Now my advice for those who die (Taxman!) Declare the pennies on your eyes (Taxman!) Cause I'm the taxman Yeah, I'm the taxman You might also like And you're working for no one but me (Taxman!)"

In [15]:
# now turn it into a function so we can try it out on other songs
def clean_lyrics(lyrics):
    clean = re.sub('(\[.*\])','',lyrics).split('\n')[1:]
    clean = ' '.join(clean)
    clean = re.sub('\d+Embed','',clean)
    return clean.strip() # strip takes out any leading or ending spaces

In [16]:
# works for Eleanor Rigby!
clean_lyrics(revolver.lyrics[1])

"Ah, look at all the lonely people! Ah, look at all the lonely people!  Eleanor Rigby Picks up the rice in the church where a wedding has been Lives in a dream Waits at the window Wearing the face that she keeps in a jar by the door Who is it for?  All the lonely people Where do they all come from? All the lonely people Where do they all belong?  Father McKenzie Writing the words of a sermon that no one will hear No one comes near Look at him working Darning his socks in the night when there's nobody there What does he care?   All the lonely people Where do they all come from? All the lonely people Where do they all belong?   Ah, look at all the lonely people! Ah, look at all the lonely people!  Eleanor Rigby Died in the church and was buried along with her name Nobody came Father McKenzie Wiping the dirt from his hands as he walks from the grave No one was saved You might also like All the lonely people (Ah, look at all the lonely people!) Where do they all come from? All the lonely p

In fact, this function will work for all of the songs in our `DataFrame`. And this is a very common pattern to confront in data cleaning: you'll have a column of unclean text and a function that will clean it. Thankfully, `pandas` gives us a method, called `apply`, that will apply this `clean_lyrics` function to every row in our `lyrics` column.

In [20]:
revolver.lyrics.apply(clean_lyrics) # apply takes a function as an argument

0     One, two, three, four One, two—  (One, two, th...
1     Ah, look at all the lonely people! Ah, look at...
2     When I wake up early in the morning Lift my he...
3     Each day just goes so fast I turn around, it's...
4     To lead a better life, I need my love to be he...
5     In the town where I was born Lived a man who s...
6     She said, "I know what it's like to be dead I ...
7     Good day sunshine Good day sunshine Good day s...
8     You tell me that you've got everything you wan...
9     Your day breaks, your mind aches You find that...
10    Ring my friend, I said you'd call Doctor Rober...
11    I want to tell you My head is filled with thin...
12    I was alone, I took a ride I didn't know what ...
13    Turn off your mind, relax and float downstream...
Name: lyrics, dtype: object

In [23]:
revolver.lyrics.apply(clean_lyrics)[2]
# etc...

"When I wake up early in the morning Lift my head, I'm still yawning When I'm in the middle of a dream Stay in bed, float up stream (Float up stream)   Please, don't wake me, no, don't shake me Leave me where I am, I'm only sleeping   Everybody seems to think I'm lazy I don't mind, I think they're crazy Running everywhere at such a speed Till they find there's no need (There's no need)  Please, don't spoil my day, I'm miles away And after all, I'm only sleeping     Keeping an eye on the world going by my window Taking my time   Lying there and staring at the ceiling Waiting for a sleepy feeling    Please, don't spoil my day, I'm miles away And after all, I'm only sleeping   Yawn, Paul! *Yawns*   Keeping an eye on the world going by my window Taking my time You might also like When I wake up early in the morning Lift my head, I'm still yawning When I'm in the middle of a dream Stay in bed, float up stream (Float up stream)   Please, don't wake me, no, don't shake me Leave me where I am,

In [24]:
# you can overwrite a column in a DataFrame the same way you would for a variable
revolver['lyrics'] = revolver['lyrics'].apply(clean_lyrics)

In [25]:
# our (mostly) clean DataFrame
revolver

Unnamed: 0,song_title,lyrics
0,Taxman,"One, two, three, four One, two— (One, two, th..."
1,Eleanor Rig,"Ah, look at all the lonely people! Ah, look at..."
2,I'm Only Sleeping,When I wake up early in the morning Lift my he...
3,Love You To,"Each day just goes so fast I turn around, it's..."
4,"Here, There and Everywhere","To lead a better life, I need my love to be he..."
5,Yellow Submarine,In the town where I was born Lived a man who s...
6,She Said She Said,"She said, ""I know what it's like to be dead I ..."
7,Good Day Sunshine,Good day sunshine Good day sunshine Good day s...
8,And Your Bird Can Sing,You tell me that you've got everything you wan...
9,For No One,"Your day breaks, your mind aches You find that..."


You may have to do more cleaning for your own project. How much or how little we clean is based on what we want to do with the text after it is clean. In the next few classes, we'll learn some advanced techniques in NLP, all of which require a clean text as an input. This walkthrough should get you started on the fundementals, but if you have any questions, specific or general, let me know!

Learn more about `pandas`: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

Learn more about `re`: https://docs.python.org/3/library/re.html, https://www.rexegg.com/regex-quickstart.html