<a href="https://colab.research.google.com/github/joshcova/NLP_Workshop/blob/main/02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data wrangling in Pandas

In this workshop we will cover basic data manipulation task. To do this we will be using the popular `pandas` library.

Libraries are a collection of in-built functions by other Python users that allow us to work more efficiently.

Thankfully the libraries that we will be working with are already downloaded in the google colab programming environment, but if not we would first need to install them (`pip install`).

After installing a library, you need to load it into your programming environment (`import`)

At its core, pandas introduces two primary data structures: `Series` (one-dimensional labeled arrays) and `DataFrame` (two-dimensional labeled data structures). For quantitative text analysis, `DataFrames` will be our workhorse, providing a flexible and efficient way to store, process, and analyze collections of text documents and their associated metadata.

In [2]:
import pandas as pd

# Now we can use pandas' in-built functions.

In [None]:
# let us create some "dummy" data

In [None]:
data = {
    "doc_id": [1, 2, 3, 4],
    "author": ["Party A", "Party B", "Party A", "Party B"],
    "year": [2020, 2020, 2021, 2021],
    "text": [
        "The economy is growing rapidly.",
        "Immigration is a major political issue.",
        "The economy faces a serious crisis.",
        "A new climate policy was announced."
    ]
}

df = pd.DataFrame(data)
df

In [None]:
df["author"]

In [None]:
df.author

In [6]:
# filtering by a condition

df_2021 = df[df["year"]==2021]

In [9]:
df_3 = df.rename(columns = {"author": "party"})

## Reading in data

Typically however we will not be creating data from scratch, we will use data that is already structured in an appropriate format

We will be using as a first dataset a slightly self-serving example, namely Cova and Germani (2025), [CommonsCorpus](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KXDDJU):

An annotated and machine-readable corpus of all UK House of Commons Parliamentary Debates (1970-2024)

In [2]:
import pandas as pd

In [3]:
# read in the dataframe (df)
df = pd.read_csv("https://raw.githubusercontent.com/joshcova/NLP_Workshop/refs/heads/main/data/brexit_data.csv")

In [None]:
# summary statistics, though not particularly helpful here

df["date"].describe()

In [None]:
# check the types of the different variables
df.dtypes

In [None]:
# tabulate by party
df["party"].value_counts()

In [4]:
df2 = df[df["party"] == "Conservative Party"]
df2 = df2[df2["text"].str.len()!= 0]

In [None]:
df2.shape

In [7]:
# filter by date
df2['date'] = pd.to_datetime(df2['date'])
df_conservative_2018 = df2[(df2["date"] > "2018-01-01")]

In [None]:
df_conservative_2018.shape

In [9]:
df_conservative_2018 = df_conservative_2018.drop(["Unnamed: 0"], axis=1)