# An introduction to `pandas`


'Give someone a dataset, feed them for a day. Teach someone to create a dataset, feed them for a lifetime' -- me


## What is `pandas` and why does it matter?

`pandas` is a Python library that provides a couple new data structures that give users much more flexibility when reading in or exporting data. The main way that we will interact with `pandas` is through the pre-defined `DataFrame` object. `DataFrame`s are a very different data structure than what you may be used to from this class or other classes. It is most similar to a SQL database or an Excel spreadsheet in that it expects 'tabular data' (data that can be arranged in a table). 

`pandas` is very unique in that it is not just written in Python but also an intermediate language between C and Python called `Cython`. This feature means that `pandas` has all of the ease of a Python library, while also utilizing all of the speed and efficiency of C. Thus, `pandas` has become a critical component in data science, data analytics and, of course, natural language process and the digital humanities. 

In [1]:
import pandas as pd

You will almost always see `pandas` imported like this. Although the beginning looks normal, this `as` keyword tells Python we intend to use the shorthand `pd` to stand in for `pandas`. Importantly, we cannot use `pandas`, if we import it like this. We can only use `pd`. See below:

In [2]:
pd

<module 'pandas' from 'C:\\Users\\Minnie\\anaconda3\\lib\\site-packages\\pandas\\__init__.py'>

In [3]:
pandas

NameError: name 'pandas' is not defined

We get a `NameError` because we told Python to import `pandas` as `pd`, so `pd` is the only name that works.

## Sample datasets for your final projects

For your final project, you'll need to use your a new dataset. At your disposal are a couple datasets that I have put together for you: https://github.com/pnadelofficial/datasets. They are in varying states of cleanness and ease of use in order to simulate a dataset you might find in the wild. Feel free to use any of these for your final project or the make one yourself.

Today, I'll walk you through using the `lyrics_app` and we'll also get a tour of `pandas` and experiment with data cleaning. 

In [4]:
!pip install wget
import wget
import os

# downloading lyrics app
if not os.path.isfile('lyrics.py'):
    wget.download('https://raw.githubusercontent.com/pnadelofficial/datasets/main/lyrics_app/lyrics.py')
    
# raise your hand if you get an error here...

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py): started
  Building wheel for wget (setup.py): finished with status 'done'
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=4e03a94af52557c4af3912300f39cdd9972f4282ae8bb93849a09344d5e58419
  Stored in directory: c:\users\minnie\appdata\local\pip\cache\wheels\04\5f\3e\46cc37c5d698415694d83f607f833f83f0149e49b3af9d0f38
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


This file takes in an artist name and an album name and will return a comma separated values file or `csv` of all of the lyrics in that album. But you must use the command line. So let's see how we can do that from this notebook. 

In [5]:
!python lyrics.py 'The Beatles' 'Revolver'
# this will save a file called 'The Beatles-Revolver_lyrics.csv' in my directory
# will take around a minute

Collecting lyricsgenius
  Downloading lyricsgenius-3.0.1-py3-none-any.whl (59 kB)
Installing collected packages: lyricsgenius
Successfully installed lyricsgenius-3.0.1
Searching for "Beatles'" by 'The...
Wrote Lyrics_Beatles.json.
Data saved
Data loaded


In [10]:
# read in as a DataFrame
revolver = pd.read_csv("'The-Beatles'_lyrics.csv")
revolver

Unnamed: 0,song_title,lyrics
0,I Saw Her Standing There,I Saw Her Standing There Lyrics[Intro: Paul Mc...
1,Misery,Misery Lyrics[Intro]\nThe world is treating me...
2,Anna (Go to Him),Anna (Go to Him) Lyrics[Verse 1]\nAnna\nYou co...
3,Chains,"Chains Lyrics[Chorus]\nChains, my baby's got m..."
4,Boys,Boys Lyrics[Verse 1]\nI been told when a boy k...
...,...,...
216,The Ballad of John and Yoko,The Ballad of John and Yoko Lyrics[Verse 1: Jo...
217,Old Brown Shoe,Old Brown Shoe Lyrics[Verse 1]\nI want a love ...
218,Across the Universe (Wildlife Version),Across the Universe (Wildlife Version) Lyrics[...
219,Let It Be (Single Version),Let It Be (Single Version) Lyrics[Verse 1]\nWh...


In [11]:
type(revolver)

pandas.core.frame.DataFrame

In [12]:
revolver?

This is a `DataFrame`. As we see above, we can create one from a `csv` file, but we will also see how we can take a `dict` or `list` and convert it into a `DataFrame` in our next class.

Now let's take a look at what the output is. We can do that by first selecting the column we want to look at, in this case `lyrics`, and then selecting the row, in this case the first row (index = 0).

In `pandas`, there are two ways of selecting a column: 
* `revolver['lyrics']`
* `revolver.lyrics`

These will not just produce the same thing, but actually refer to the same process in computer memory.

To select a row, we just need to put the index position of the row we want in square brackets after the column, just like in a list. See below:

In [13]:
# selecting a column
revolver.lyrics # equivalent to revolver['lyrics']

0      I Saw Her Standing There Lyrics[Intro: Paul Mc...
1      Misery Lyrics[Intro]\nThe world is treating me...
2      Anna (Go to Him) Lyrics[Verse 1]\nAnna\nYou co...
3      Chains Lyrics[Chorus]\nChains, my baby's got m...
4      Boys Lyrics[Verse 1]\nI been told when a boy k...
                             ...                        
216    The Ballad of John and Yoko Lyrics[Verse 1: Jo...
217    Old Brown Shoe Lyrics[Verse 1]\nI want a love ...
218    Across the Universe (Wildlife Version) Lyrics[...
219    Let It Be (Single Version) Lyrics[Verse 1]\nWh...
220    You Know My Name (Look Up the Number) Lyrics[I...
Name: lyrics, Length: 221, dtype: object

In [14]:
# indexing just like in a list
revolver.lyrics[0]

'I Saw Her Standing There Lyrics[Intro: Paul McCartney]\nOne, two, three, four!\n[Verse 1: Paul McCartney, McCartney & John Lennon]\nWell, she was just seventeen and you know what I mean\nAnd the way she looked was way beyond compare\nSo how could I dance with another? (Oh)\nWhen I saw her standing there?\n\n[Verse 2: Paul McCartney, McCartney & John Lennon]\nWell, she looked at me, and I, I could see\nThat before too long, I\'d fall in love with her\nShe wouldn\'t dance with another (Woah)\nWhen I saw her standing there\n[Bridge: Paul McCartney & John Lennon]\nWell, my heart went "boom" when I crossed that room\nAnd I held her hand in mine\n\n[Verse 3: Paul McCartney, McCartney & John Lennon]\nWell, we danced through the night, and we held each other tight\nAnd before too long, I fell in love with her\nNow, I\'ll never dance with another (Woah)\nSince I saw her standing there\n\n[Guitar Solo: George Harrison]\n\n[Bridge: Paul McCartney & John Lennon]\nWell, my heart went "boom" when I

This output looks pretty messy. In this case, there is a lot of text that indicates what's going on in the song itself, but all we want is the text of the lyrics. We're going to have to 'clean' it.

If you want to continue in computer science, data science or the digital humanities, data cleaning is something you will have to confront. No dataset is perfectly clean and prepared for the task you want to do with it. Instead we can use `pandas` to manipulate the data into any form that is convenient for us to work with.  

In [15]:
taxman = revolver.lyrics[0] # make a variable for later

Below I'm going to use a package called `re` or regular expressions. This is a very old programming/text analysis language which we can use in Python to navigate through strings easier. It may look intimidating now, but once you use it more, it will make a lot of sense and it will be clear why it's been around almost as long as computing itself.

Nothing is a substitute for learning regex, but I find it difficult to introduce to students because it has a high learning curving. This website: https://www.autoregex.xyz/ is a great tool to get started with regex. It uses state of the art sentence transformers (something we may learn about at the very end of this course) to convert an English prompt into a regular expression. Let's try it now for selecting all of the text between square brackets. 

In [16]:
# let's get rid of all of the text between square brackets
import re 
re.sub('(\[.*\])','',taxman)

'I Saw Her Standing There Lyrics\nOne, two, three, four!\n\nWell, she was just seventeen and you know what I mean\nAnd the way she looked was way beyond compare\nSo how could I dance with another? (Oh)\nWhen I saw her standing there?\n\n\nWell, she looked at me, and I, I could see\nThat before too long, I\'d fall in love with her\nShe wouldn\'t dance with another (Woah)\nWhen I saw her standing there\n\nWell, my heart went "boom" when I crossed that room\nAnd I held her hand in mine\n\n\nWell, we danced through the night, and we held each other tight\nAnd before too long, I fell in love with her\nNow, I\'ll never dance with another (Woah)\nSince I saw her standing there\n\n\n\n\nWell, my heart went "boom" when I crossed that room\nAnd I held her hand in mine\n\n\nOh, we danced through the night, and we held each other tight\nAnd before too long, I fell in love with her\nNow I\'ll never dance with another (Oh)\nSince I saw her standing there\n\n\nOh, since I saw her standing there\nYeah

Let's break that down:
* re.sub: This is the substitute function in `re`. It works very similarly to `replace` in Python. It takes three arguments which are below
* Argument 1: a pattern or what's sometimes called a 'regular expression'. This is just a string, but to the `re` package it means something more. Taking a closer look at this one:

`(`: opens capture group

`\[`: looks for the `[` character, as we know the square bracket has meaning in Python (and other programming languages) so we need to add the backslash to tell `re` that we are looking for the square bracket and not creating a Python list

`.*`: looks for any character as many times as possible

`\]`: again we have to use the backslash to specify that we are looking for the end bracket character

`)`: ends capture group

* Argument 2: a replacement string, that is what do we want to substitute our pattern with
* Argument 3: the string we want to change

Another great regex tool is https://regex101.com/ which allows you to test different patterns to make sure that you are matching what you want to be matching. Both this site and the other one from above are great tools for getting started with regex, but no website or algorithm can replace the practice needed to master regex.

In [17]:
# what else? well, let's get rid of the 'Taxman Lyrics' at the beginning
# we see it's followed by a new line (\n), so we can split on that character
# this has the added benefit of removing all of the new line characters (\n)
re.sub('(\[.*\])','',taxman).split('\n')[1:]

['One, two, three, four!',
 '',
 'Well, she was just seventeen and you know what I mean',
 'And the way she looked was way beyond compare',
 'So how could I dance with another? (Oh)',
 'When I saw her standing there?',
 '',
 '',
 'Well, she looked at me, and I, I could see',
 "That before too long, I'd fall in love with her",
 "She wouldn't dance with another (Woah)",
 'When I saw her standing there',
 '',
 'Well, my heart went "boom" when I crossed that room',
 'And I held her hand in mine',
 '',
 '',
 'Well, we danced through the night, and we held each other tight',
 'And before too long, I fell in love with her',
 "Now, I'll never dance with another (Woah)",
 'Since I saw her standing there',
 '',
 '',
 '',
 '',
 'Well, my heart went "boom" when I crossed that room',
 'And I held her hand in mine',
 '',
 '',
 'Oh, we danced through the night, and we held each other tight',
 'And before too long, I fell in love with her',
 "Now I'll never dance with another (Oh)",
 'Since I saw her 

In [18]:
# turn the list back into a string
taxman_clean = ' '.join(re.sub('(\[.*\])','',taxman).split('\n')[1:])
taxman_clean

'One, two, three, four!  Well, she was just seventeen and you know what I mean And the way she looked was way beyond compare So how could I dance with another? (Oh) When I saw her standing there?   Well, she looked at me, and I, I could see That before too long, I\'d fall in love with her She wouldn\'t dance with another (Woah) When I saw her standing there  Well, my heart went "boom" when I crossed that room And I held her hand in mine   Well, we danced through the night, and we held each other tight And before too long, I fell in love with her Now, I\'ll never dance with another (Woah) Since I saw her standing there     Well, my heart went "boom" when I crossed that room And I held her hand in mine   Oh, we danced through the night, and we held each other tight And before too long, I fell in love with her Now I\'ll never dance with another (Oh) Since I saw her standing there   Oh, since I saw her standing there Yeah, well, since I saw her standing thereYou might also like21Embed'

In [19]:
# the last thing is the weird '10Embed' thing
# let's make a new regular expression for that
re.sub('\d+Embed','',taxman_clean) # \d means any digit, + means any amount of times

'One, two, three, four!  Well, she was just seventeen and you know what I mean And the way she looked was way beyond compare So how could I dance with another? (Oh) When I saw her standing there?   Well, she looked at me, and I, I could see That before too long, I\'d fall in love with her She wouldn\'t dance with another (Woah) When I saw her standing there  Well, my heart went "boom" when I crossed that room And I held her hand in mine   Well, we danced through the night, and we held each other tight And before too long, I fell in love with her Now, I\'ll never dance with another (Woah) Since I saw her standing there     Well, my heart went "boom" when I crossed that room And I held her hand in mine   Oh, we danced through the night, and we held each other tight And before too long, I fell in love with her Now I\'ll never dance with another (Oh) Since I saw her standing there   Oh, since I saw her standing there Yeah, well, since I saw her standing thereYou might also like'

In [20]:
taxman_clean = re.sub('\d+Embed','',taxman_clean)
taxman_clean

'One, two, three, four!  Well, she was just seventeen and you know what I mean And the way she looked was way beyond compare So how could I dance with another? (Oh) When I saw her standing there?   Well, she looked at me, and I, I could see That before too long, I\'d fall in love with her She wouldn\'t dance with another (Woah) When I saw her standing there  Well, my heart went "boom" when I crossed that room And I held her hand in mine   Well, we danced through the night, and we held each other tight And before too long, I fell in love with her Now, I\'ll never dance with another (Woah) Since I saw her standing there     Well, my heart went "boom" when I crossed that room And I held her hand in mine   Oh, we danced through the night, and we held each other tight And before too long, I fell in love with her Now I\'ll never dance with another (Oh) Since I saw her standing there   Oh, since I saw her standing there Yeah, well, since I saw her standing thereYou might also like'

In [21]:
# now turn it into a function so we can try it out on other songs
def clean_lyrics(lyrics):
    clean = re.sub('(\[.*\])','',lyrics).split('\n')[1:]
    clean = ' '.join(clean)
    clean = re.sub('\d+Embed','',clean)
    return clean.strip() # strip takes out any leading or ending spaces

In [22]:
# works for Eleanor Rigby!
clean_lyrics(revolver.lyrics[1])

"The world is treating me bad, misery   I'm the kind of guy Who never used to cry The world is treating me bad, misery   I've lost her now for sure I won't see her no more It's gonna be a drag, misery  I'll remember all the little things we've done Can't she see she'll always be the only one, only one  Send her back to me 'Cause everyone can see Without her I will be in misery  I'll remember all the little things we've done She’ll remember and she’ll miss her only one, lonely one  Send her back to me 'Cause everyone can see Without her I will be in misery (Oh oh oh) In misery (Ooh ooh) My misery (La la la la la la) MiseryYou might also like"

In fact, this function will work for all of the songs in our `DataFrame`. And this is a very common pattern to confront in data cleaning: you'll have a column of unclean text and a function that will clean it. Thankfully, `pandas` gives us a method, called `apply`, that will apply this `clean_lyrics` function to every row in our `lyrics` column.

In [23]:
revolver.lyrics.apply(clean_lyrics) # apply takes a function as an argument

TypeError: expected string or bytes-like object

In [23]:
revolver.lyrics.apply(clean_lyrics)[2]
# etc...

"When I wake up early in the morning Lift my head, I'm still yawning When I'm in the middle of a dream Stay in bed, float up stream (Float up stream)   Please, don't wake me, no, don't shake me Leave me where I am, I'm only sleeping   Everybody seems to think I'm lazy I don't mind, I think they're crazy Running everywhere at such a speed Till they find there's no need (There's no need)  Please, don't spoil my day, I'm miles away And after all, I'm only sleeping     Keeping an eye on the world going by my window Taking my time   Lying there and staring at the ceiling Waiting for a sleepy feeling    Please, don't spoil my day, I'm miles away And after all, I'm only sleeping   Yawn, Paul! *Yawns*   Keeping an eye on the world going by my window Taking my time You might also like When I wake up early in the morning Lift my head, I'm still yawning When I'm in the middle of a dream Stay in bed, float up stream (Float up stream)   Please, don't wake me, no, don't shake me Leave me where I am,

In [24]:
# you can overwrite a column in a DataFrame the same way you would for a variable
revolver['lyrics'] = revolver['lyrics'].apply(clean_lyrics)

In [25]:
# our (mostly) clean DataFrame
revolver

Unnamed: 0,song_title,lyrics
0,Taxman,"One, two, three, four One, two— (One, two, th..."
1,Eleanor Rig,"Ah, look at all the lonely people! Ah, look at..."
2,I'm Only Sleeping,When I wake up early in the morning Lift my he...
3,Love You To,"Each day just goes so fast I turn around, it's..."
4,"Here, There and Everywhere","To lead a better life, I need my love to be he..."
5,Yellow Submarine,In the town where I was born Lived a man who s...
6,She Said She Said,"She said, ""I know what it's like to be dead I ..."
7,Good Day Sunshine,Good day sunshine Good day sunshine Good day s...
8,And Your Bird Can Sing,You tell me that you've got everything you wan...
9,For No One,"Your day breaks, your mind aches You find that..."


You may have to do more cleaning for your own project. How much or how little we clean is based on what we want to do with the text after it is clean. In the next few classes, we'll learn some advanced techniques in NLP, all of which require a clean text as an input. This walkthrough should get you started on the fundementals, but if you have any questions, specific or general, let me know!

Learn more about `pandas`: https://pandas.pydata.org/docs/user_guide/index.html#user-guide

Learn more about `re`: https://docs.python.org/3/library/re.html, https://www.rexegg.com/regex-quickstart.html