# How to Access the Labs and Data

Welcome to our first lab!

These labs are going to show you how to use computational linguistics to do the kinds of corpus analysis that we're talking about in *Computational Linguistics for Corpus Analysis*. Even beginners can go through these labs to see computational linguistics in action. If you want more detail, check out the *text_analytics* package that we're using: [https://github.com/jonathandunn/text_analytics](https://github.com/jonathandunn/text_analytics). You can use the package as a source for code examples and best practices. But you can also install the package via pip in order to use it directly:

        pip install textanalytics  #Last stable release
        pip install git+https://github.com/jonathandunn/text_analytics.git  #Most recent release

We're going to start by loading our dependencies. That just means that we start up all the packages that we'll need in order to do the lab. This is how we'll start each and every lab. 

So, start by **running** the line below. Click on the box and then press the "Run" button. (It has an icon that looks like your typical "play" icon). You'll notice, if you look at the top-right corner of the screen, that there is an empty circle next to "Python 3". When the code is running, that circle will be filled in. So, always wait until the code finishes before moving on. We'll also have the code print something line "Done!" every time it finishes.

In [1]:
from text_analytics import TextAnalytics
import os
import pandas as pd
print("Done!")

Done!


Excellent! Now we have our environment set-up. So, let's load our package now that we've imported it. Here we're telling python that the name *ai* refers to the *text_analytics* package. You can name it anything you'd like, but that's the convention that we will be using. After that, every time we use **ai.something** it is referring to the contents of the package.

In [2]:
ai = TextAnalytics()
ai.data_dir = os.path.join(".", "data")
print("Done!")

Done!


In these labs we'll be using this *text_analytics* package to do our analysis. This makes it easier for those who don't have as much experience in Python. But, if you are comfortable with Python, the *text_analytics* package also provides examples for how to do everything that we cover in this book in more detail.

Now that we've loaded everything we need, let's open the data. Run the line below to set the location of our data. The filename is "economic.congress.1931-2016.gz" and this first line gives us the path to where we've stored the data. That path doesn't mean anything to you, but it tells the code notebook where to look. Then, we use the *pandas* package to open the data. This is a big corpus, so we only take the first 1,000 lines.

In [3]:
file = os.path.join(ai.data_dir, "economic.congress.1931-2016.gz")
df = pd.read_csv(file, nrows = 1000)
print(df)
print("Done!")

     Year  Month Chamber Party  \
0    1931      1       S     R   
1    1931      1       S     R   
2    1931      1       S     D   
3    1931      1       S     R   
4    1931      1       S     D   
..    ...    ...     ...   ...   
995  1931      1       S     D   
996  1931      1       S     D   
997  1931      1       S     D   
998  1931      1       S     D   
999  1931      1       S     D   

                                                  Text  
0    Mr. President. I desire to move at this time. ...  
1    Mr. President. in the nature of a memorial. I ...  
2    Mr. President. I introduced and had referred t...  
3    I ask unanimous consent to have printed in the...  
4    Mr. President. during the consideration of the...  
..                                                 ...  
995  From the Pawnee Agency. at Pawnee. Okla.. I re...  
996  I have a telegram from the Muskogee Agency. th...  
997  I call attention to a report submitted by the ...  
998  It has been in p

We're looking at congressional speeches from the US. The different columns give us information about each speech, like what year it was delivered. And the "Text" column gives us the actual data from each sample. Let's take a look at a random selection.

In [4]:
ai.print_sample(df)
print("\nDone!")


Done!


You'll probably notice how messy this looks! That's because we haven't done any cleaning or pre-processing.

And that's it for our first lab! Today we loaded our dependencies into the environment, created an instance of our *text_analytics* package, and looked at one corpus. Our labs in this course are going to be short and simple like this. You can always play around with the data by changing the code that is written in the cells. In fact, that's a great way to learn how it all works. And, remember, you can always get more details by looking at the *text_analytics* package directly.

We will be working with a few different types of corpora in this book. The first, part of our case-study on *Socio-Economic Indicators*, includes formal speeches to congress and lead paragraphs for articles from *The New York Times*. Both of these corpora cover 1931 to 2016. The congressional corpus contains about 841 million words. The news corpus contains 341 million words ("economic.nyt.1931-2016.gz").

The second kind of corpus we will be using is published books from the 19th century and early 20th century. We've divided this larger corpus into authors born 1800-1850 and those born 1851-1900. Altogether, this corpus contains about 1,042 million words. Here's one segment: "stylistics.authorship_1850.gz".

The third family of corpora we will work with is from digital sources: the web and Twitter. This data is geo-referenced, so we know what country for what city each sample is from. These corpora are about 841 million words in total: "sociolinguistics.english_cities.gz" and "sociolinguistics.english_dialects.gz".

These previous corpora represent many different registers, but they are also all English corpora. We get into multi-lingual data with a set of 39 languages with data from three sources: Wikipedia, Twitter, and the web. For each of these 39 languages we have the same amount of data from the same sources. This set of corpora is found in a separate folder: "\register\Register.ara.gz". Note that the language code is different for each language, but otherwise the naming convention is consistent.

Now it's your turn! Use the code box below to load a different data set and print some samples from it.