## Summary

### Loading the Data into a DataFrame
Before we use our data, we need to get the data and make sure it is in a form we can use. There are a lot of ways to do that, but in this course, we will use a [pandas DataFrame](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe).

We want to set up our DataFrame called df so there is only one column, "text". Each row will contain one snippet of text.

In this demo, we walked through how to load the data from the 2022 Wikipedia dataset:

Importing pandas library
Creating a DataFrame called df
Adding the list of strings from the previous step to df as a column called "text"
Below is the version of this code used in the case study notebook:

        import pandas as pd

        # Load page text into a dataframe
        df = pd.DataFrame()
        df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")

### Wrangling the Data for Ingestion into the Model
Data from the API is much cleaner than raw website source code, but it still needs some work to be ideally configured for our purposes.

In this demo, we walked through how to wrangle and clean the data in df:

- Addressing the problem of empty rows by subsetting to include only rows where the length is > 0
- Addressing the problem of headings by subsetting to exclude rows where the text starts with ==
- Addressing the problem of rows without dates using a date parser and somewhat more complex logic
  
Below is the version of this code used in the case study notebook:

        from dateutil.parser import parse

        # Clean up text to remove empty lines and headings
        df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

        # In some cases dates are used as headings instead of being part of the
        # text sample; adjust so dated text samples start with dates
        prefix = ""
        for (i, row) in df.iterrows():
            # If the row already has " - ", it already has the needed date prefix
            if " – " not in row["text"]:
                try:
                    # If the row's text is a date, set it as the new prefix
                    parse(row["text"])
                    prefix = row["text"]
                except:
                    # If the row's text isn't a date, add the prefix
                    row["text"] = prefix + " – " + row["text"]
        df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

## Additional References

[Pandas Documentation](https://pandas.pydata.org/docs/)