# Creating a Sensemaking Data Pipeline with Airflow

**Mauricio Ferragut**


# Index
- [Abstract](#-Abstract)
- [1. Introduction](#1.-Introduction)
- [2. Data Pipeline Tasks](#2.-Data-Pipeline-Tasks)
- [3. Airflow](#3.-Airflow)
- [4. Visualizing Word Frequency](#4.-Visualizing-Word-Frequency)
- [References](#References)

[Back to top](#Index)
## Abstract
The overarching goal of the project is to make sense of unstructured MIT course catalog data downloaded from the University website and structure it by creating a sensemaking data pipeline which counts word frequency in course titles. The project breaks down the process of making the data pipeline into discrete tasks and uses Airflow to automate each of the tasks. The goal for the project is to run the tasks end to end without needing any human intervention to extract and make sense of the data. Once the pipeline is complete, the data is visualized in word bubbles utilizing the D3 library.


[Back to top](#Index)
## 1. Introduction

The data used for this project comes from the MIT course catalog. The data pipeline process is broken down into discrete tasks in a python script which is then utilized to create a Directed Acyclic Graph (DAG) in Apache Airflow, where the pipeline is executed. 


[Back to top](#Index)
## 2. Data Pipeline Tasks

The Python script for this DAG can be found in the same repository this notebook is located in, within 'airflow-docker/dags/assignment.py'. The following is a conceptual explanation of the discrete tasks the pipeline is broken down into.

### Task 0: Ensure beautifulsoup4 library is installed
This task runs the bash command 'pip install beautifulsoup4', which task 3 depends upon to parse the HTML data.

### Task 1: Pull Catalogs
The first step in the pipeline is to iterate through the list of course catalog URLs, pull the course catalogs from the MIT website, read each file, and write the contents of each file to the local machine. The files are named after the URL they came from (Ex. m1a.html).

### Task 2: Combine Unstructured Data
The second step in the pipeline is to combine all the unstructured data files from task 1 into one large file by iterating through files ending in .html and writing the contents of those files into a single document.

### Task 3: Parse Out Course Titles
The third step in the pipeline is to parse out the course titles from the HTML data. This is accomplished using the BeautifulSoup library to parse throught the HTML and find all information contained within \<h3> tags, which is where course titles are located. This data is then appended to a 'titles' list, which is stored as a .json file.

### Task 4: Clean the Course Titles
The fourth step in the pipeline is to remove all punctuation, numbers, one-character words, and other common words such as 'and, of, to, in, the' from the titles.json file. The cleaned titles are stored into a 'titles_clean.json' document. 

### Task 5: Count Word Frequency
The fifth and final step in the pipeline is to count word frequencies. Once the frequencies are counted, the data is stored into a 'words.json' document which containes each unique word and its associated frequency.

[Back to top](#Index)
## 3. Airflow

Airflow was run inside a docker container using the docker-compose.yml found in the airflow-docker folder of the project. As the Python script defined the order of the tasks for the DAG, and the script was placed into the dags folder, the Pipeline was now ready to be run. The UI for the completed pipeline can be seen below.

<img src="screenshots\AirflowTasks.png">

[Back to top](#Index)
## 4. Visualizing Word Frequency
Finally, to visualize the word frequency data produced by the pipeline, two visualization techniques were used. Both of these visualizations utilize the D3 javascript library. They are both located within the 'project-23/code_visualization' folder and can be accessed by opening the .html files.

Below are screenshots of each:

The first visualization is dynamic and shows the word associated with each bubble when hovered over. It also has a fun interaction where the bubbles can be dragged around by clicking.
<img src="screenshots\DynamicBubble.png">

The second visualization is static and shows the word associated with each bubble.
<img src="screenshots\WordBubble.png">

[Back to top](#Index)
## References
The course catalogs were downloaded from the following URLs:

http://student.mit.edu/catalog/m1a.html

http://student.mit.edu/catalog/m1b.html

http://student.mit.edu/catalog/m1c.html

http://student.mit.edu/catalog/m2a.html

http://student.mit.edu/catalog/m2b.html

http://student.mit.edu/catalog/m2c.html

http://student.mit.edu/catalog/m3a.html

http://student.mit.edu/catalog/m3b.html

http://student.mit.edu/catalog/m4a.html

http://student.mit.edu/catalog/m4b.html

http://student.mit.edu/catalog/m4c.html

http://student.mit.edu/catalog/m4d.html

http://student.mit.edu/catalog/m4e.html

http://student.mit.edu/catalog/m4f.html

http://student.mit.edu/catalog/m4g.html

http://student.mit.edu/catalog/m5a.html

http://student.mit.edu/catalog/m5b.html

http://student.mit.edu/catalog/m6a.html

http://student.mit.edu/catalog/m6b.html

http://student.mit.edu/catalog/m6c.html

http://student.mit.edu/catalog/m7a.html

http://student.mit.edu/catalog/m8a.html

http://student.mit.edu/catalog/m8b.html

http://student.mit.edu/catalog/m9a.html

http://student.mit.edu/catalog/m9b.html

http://student.mit.edu/catalog/m10a.html

http://student.mit.edu/catalog/m10b.html

http://student.mit.edu/catalog/m10c.html

http://student.mit.edu/catalog/m11a.html

http://student.mit.edu/catalog/m11b.html

http://student.mit.edu/catalog/m11c.html

http://student.mit.edu/catalog/m12a.html

http://student.mit.edu/catalog/m12b.html

http://student.mit.edu/catalog/m12c.html

http://student.mit.edu/catalog/m14a.html

http://student.mit.edu/catalog/m14b.html

http://student.mit.edu/catalog/m15a.html

http://student.mit.edu/catalog/m15b.html

http://student.mit.edu/catalog/m15c.html

http://student.mit.edu/catalog/m16a.html

http://student.mit.edu/catalog/m16b.html

http://student.mit.edu/catalog/m18a.html

http://student.mit.edu/catalog/m18b.html

http://student.mit.edu/catalog/m20a.html

http://student.mit.edu/catalog/m22a.html

http://student.mit.edu/catalog/m22b.html

http://student.mit.edu/catalog/m22c.html