# Data Engineering Capstone Project - Metro_Madrid <a class="anchor" id="top"></a>

## Project Summary
This notebook gets output data from [metro-big-data-unir](https://github.com/juananthony/metro-big-data-unir) project and create a model for a data lake. The used data is based on all mentions on Twitter to offcial account of Metro de Madrid service.

[*Metro de Madrid*](https://www.metromadrid.es/) is the name of the tube/subway service that operates in Madrid, Spain. This service has 302 stations on 13 lines plus a light rail system called *Metro Ligero*. This service was used in 2019 more than 677 million times.

The project follows the follow steps:
* [Step 1: Scope the Project and Gather Data](#step-1)
* [Step 2: Explore and Assess the Data](#step-2)
* [Step 3: Define the Data Model](#step-3)
* [Step 4: Run ETL to Model the Data](#step-4)
* [Step 5: Project Write Up](#step-5)

***
### Set up the environment
First of all, a Spark session is needed to process all the data.

In [1]:
import os
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession \
                .builder \
                .appName('udacity-capstone') \
                .master("local[*]") \
                .getOrCreate()
spark.conf.set('spark.sql.session.timeZone', 'CET')
spark.conf.set('spark.driver.memory', '16g')
spark.conf.set('spark.executor.memory', '16g')
spark.conf.set('spark.executor.extraJavaOptions', '-Xms1024 -Xmx8g -XX:+UseParallelGC')

***
[Back to top](#top)
## Step 1: Scope the Project and Gather Data <a class="anchor" id="step-1"></a>

### Scope 
[metro-big-data-unir](https://github.com/juananthony/metro-big-data-unir) project collects tweets from Twitter. Tweets which [@metro_madrid](https://twitter.com/metro_madrid) account is mentioned. Many messages in this social network used to be complaints about the service quality (broken mechanic stairs or air conditionar, ...), other tweets mentioned some issue in the service (huge delays in a line, ...). This project uses NLP techniques to detect those messages where an issue or complaint is mentioned to [@metro_madrid](https://twitter.com/metro_madrid) account and store all of them in a MongoDB database.

This notebook uses a CSV export of data stored in MongoDB instance. All this messages are classfied previously by [metro-big-data-unir](https://github.com/juananthony/metro-big-data-unir).

### Describe and Gather Data 

#### Dataset files
First, it is needed to define where the files are stored.

**Important**: Mentions file needs to be extracted from ```mentions_20210210.7z``` file. This file is greated than the 100Mb Github limit.

In [3]:
DATA_PATH = './data'
lines_file = 'lines.csv'
stations_file = 'stations.csv'
mentions_file = 'mentions_20210210.csv'

There is 3 file to use:
* ```lines_file```: It contains information about the lines offered by Metro de Madrid service.
* ```stations_file```: This file contains information about the stations.
* ```mentions_file```: It has all tweets where [@metro_madrid](https://twitter.com/metro_madrid) was mentioned.

#### Lines dataset
This dataset contains information about all lines in metro de Madrid service.

In [4]:
lines = spark.read.csv(os.path.join(DATA_PATH, lines_file), header=True).dropDuplicates()
lines.count()

16

In [5]:
lines.printSchema()

root
 |-- line: string (nullable = true)
 |-- regex: string (nullable = true)



There is 16 lines in the service and has 2 fields, the line name and a regex expression to search that line in a text.

#### Stations dataset
This file contains information about all stations in metro de madrid service.

In [6]:
stations = spark.read.csv(os.path.join(DATA_PATH, stations_file), header=True).dropDuplicates()
stations.count()

281

In [7]:
stations.printSchema()

root
 |-- station: string (nullable = true)
 |-- regex: string (nullable = true)



There is 281 stations in the service and has 2 fields also, the station name and a regex expression to search that station in a text.

#### Mentions dataset
This file is a MongoDB extraction as CSV file of all tweets stored previously. To read that CSV, a schema is used to ensure the Spark reading process uses the proper data types for each field.

In [8]:
from schemas.mentions_schema import mentions_schema
from pyspark import StorageLevel
mentions = spark \
                .read \
                .csv(os.path.join(DATA_PATH, mentions_file), header=True, multiLine=True, escape='"', schema=mentions_schema)
mentions.count()

955167

In [9]:
mentions.groupBy('classification').count().show()

+--------------+------+
|classification| count|
+--------------+------+
|          null| 85465|
|       nothing|698591|
|     complaint|111612|
|         issue| 59499|
+--------------+------+



In [10]:
mentions.printSchema()

root
 |-- _id: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- text: string (nullable = true)
 |-- id: long (nullable = true)
 |-- in_reply_to_status_id: string (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name: string (nullable = true)
 |-- user_screen_name: string (nullable = true)
 |-- user_description: string (nullable = true)
 |-- user_profile_image_url: string (nullable = true)
 |-- user_profile_image_url_https: string (nullable = true)
 |-- extended_tweet_full_text: string (nullable = true)
 |-- classification: string (nullable = true)



This dataset contains 902,865 records and are classified in 4 different values: ```null```, ```nothing```, ```complaint```, and ```issue```. It has the following fields:
* ```_id```: 
    * ```string``` value.
    * MongoDB document identifier.
* ```created_at```:
    * ```string``` value.
    * This field represent the exact time where the tweet message was created. It is formatted using ```E MMM d HH:m:s Z y``` format.
* ```text```:
    * ```string``` value.
    * Tweet text. This field has 140 character. For that reason, it can be truncated if the text has more than 140. If that happen, the complete text is stored in another field: ```extended_tweet_full_text```.
* ```id```:
    * ```long``` value.
    * Tweet id. This id is a unique value for each tweet message.
* ```in_reply_to_status_id```:
    * ```string``` value.
    * In Twitter, users can post a tweet replying to another tweet. If a user replies to another one, the other tweet id is stored in this field.
* ```user_id```:
    * ```long``` value.
    * This id is a unique value for users. This value contains the id of the tweet author.
* ```user_name```:
    * ```string``` value.
    * Full name of the user.
* ```user_screen_name```:
    * ```string``` value.
    * This is an unique user nickname.
* ```user_description```:
    * ```string``` value.
    * This is the description that every user write about themselves.
* ```user_profile_image_url```:
    * ```string``` value.
    * URL to get the user profile image.
* ```user_profile_image_url_https```:
    * ```string``` value.
    * URL to get the user profile image (using HTTPS).
* ```extended_tweet_full_text```:
    * ```string``` value.
    * If ```text``` field is truncated because it is longer than 140 characters, ```extended_tweet_full_text``` stored the entire tweet text.
* ```classification```:
    * ```string``` value.
    * This field stores the classification result by [metro-big-data-unir](https://github.com/juananthony/metro-big-data-unir).

***
[Back to top](#top)
## Step 2: Explore and Assess the Data <a class="anchor" id="step-2"></a>
### Explore the Data 
The goal of this step is to identify data quality issues, like missing values, duplicate data, etc.

#### Lines
Check null if any line row exists with ```line``` or ```regex``` with null values.

In [11]:
from pyspark.sql.functions import col
lines.filter(col('line').isNull() | col('regex').isNull()).count()

0

Check if there is any station row contains null values in any column.

In [12]:
stations.filter(col('station').isNull() | col('regex').isNull()).count()

0

Now, it checks if there is any duplicate by line name.

In [13]:
lines.groupBy('line').count().filter(col('count') > 1).show()

+----+-----+
|line|count|
+----+-----+
+----+-----+



It checks if there is any duplicate by station name.

In [14]:
stations.groupBy('station').count().filter(col('count') > 1).show()

+-------+-----+
|station|count|
+-------+-----+
+-------+-----+



Now, it ensures that there is any mention duplicated by tweet id.

In [15]:
from pyspark.sql.functions import desc
mentions.groupBy('id').count().filter(col('count')>1).orderBy(desc('count')).count()

52284

There is tweet duplicates. It can be store several times because of different Twitter listeners working at the same time. In the next steps, those duplicates will be removed.

The next cell checks if there is any mention row with null values in the following fields:

In [16]:
mentions.filter(col('id').isNull()
                | col('text').isNull()
                | col('user_id').isNull()
                | col('user_screen_name').isNull()
                | col('created_at').isNull()).show()

+---+----------+----+---+---------------------+-------+---------+----------------+----------------+----------------------+----------------------------+------------------------+--------------+
|_id|created_at|text| id|in_reply_to_status_id|user_id|user_name|user_screen_name|user_description|user_profile_image_url|user_profile_image_url_https|extended_tweet_full_text|classification|
+---+----------+----+---+---------------------+-------+---------+----------------+----------------+----------------------+----------------------------+------------------------+--------------+
+---+----------+----+---+---------------------+-------+---------+----------------+----------------+----------------------+----------------------------+------------------------+--------------+



### Cleaning Steps
It is need to clean the data to remove duplicate rows and to transform, filter useless data and transform datetime format.

First, it is need to remove duplicates in mentions dataframe based on ```id``` and ````text``` fields.

In [17]:
mentions = mentions \
                .dropDuplicates(['id', 'text']) \
                .persist(StorageLevel.MEMORY_ONLY_SER)

mentions.groupBy('id').count().filter(col('count')>1).orderBy(desc('count')).count()

0

Now, we check how many records we have without duplicates:

In [18]:
mentions.count()

902865

Because we want a datalake about issues and complaints tweets and once the duplicates are removed, we need to filter the data by classification value

In [19]:
from pyspark.sql.functions import col, to_timestamp
classes = ['complaint', 'issue']
mentions = mentions.filter(col('classification').isin(classes))

Also, we need to create a new field called ```dt``` formatting string ```created_at``` into a timestamp field. To convert that we need to use the correct datetime format that ```create_at``` uses: ```E MMM d HH:m:s Z y```.

In [20]:
mentions = mentions.withColumn('dt', to_timestamp(mentions.created_at, 'E MMM d HH:m:s Z y').alias('dt'))

Finally, due to the cauistry of the ```text``` field, a new column ```full_text``` must be created to use ```extended_tweet_full_text``` when it is not null, otherwise, it uses ```text```.

In [21]:
from pyspark.sql.functions import when
mentions = mentions.withColumn('full_text', 
                               when(~col('`extended_tweet_full_text`').isNull(), col('`extended_tweet_full_text`'))
                               .otherwise(col('text'))).drop('text','`extended_tweet_full_text`') \
                               .persist(StorageLevel.MEMORY_ONLY_SER)

***
[Back to top](#top)
## Step 3: Define the Data Model  <a class="anchor" id="step-3"></a>
### 3.1 Conceptual Data Model
The data we want to store is all messages that inform about any issue or complaint in a line or a station even if one message inform about an issue that affect two different lines. That the reason why the fact table is the inform fact, that can be a complaint or an issue. One tweet can inform about an issue that affect two lines (i.e.: a closed station and all lines that stops there). In other words, one tweet generates one or many "inform fact" records.

![fact-dimension diagram](./img/class_diagram.png "Fact-Dimension Diagram")

### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model
 
#### Line Dimension

***
[Back to top](#top)
## Step 4: Run ETL to model the data<a class="anchor" id="step-4"></a>

### 4.1 Create the data model
Build the data pipelines to create the data model.

In [22]:
OUTPUT_DIR = "./out"

#### Line Dimension
This dimension is based on the file content with all lines of Metro Madrid service.

In [23]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
lines_w = Window.orderBy('line_name')
lines_dim = lines.withColumnRenamed('line','line_name').withColumn('line_id', row_number().over(lines_w))

Lines dimension is persisted as parquet file in ```lines``` folder inside the ```OUTPUT_DIR``` variable.

In [24]:
lines_file = os.path.join(OUTPUT_DIR, 'lines')
lines_dim.write.mode('overwrite').parquet(lines_file)

***
#### Station Dimension
Station dimensino is based on the file content with all stations of Metro de Madrid service.

In [25]:
station_w = Window.orderBy('station_name')
stations_dim = stations.select(stations.station.alias('station_name'), 'regex').withColumn('station_id', row_number().over(station_w))

Stations dimension is persisted as parquet file in ```stations``` folder inside the ```OUTPUT_DIR``` variable.

In [26]:
stations_file = os.path.join(OUTPUT_DIR, 'stations')
stations_dim.write.mode('overwrite').parquet(stations_file)

***
#### Class dimension
This dimension contains all possible incident that can be detected on tweets.

In [27]:
class_df = mentions.select(col('classification').alias('class_name')).distinct()
class_w = Window.orderBy('class_name')
class_dim = class_df.withColumn('class_id', row_number().over(class_w))

Class dimension is persisted as parquet file in ```classes``` folder inside the ```OUTPUT_DIR``` variable.

In [28]:
class_file = os.path.join(OUTPUT_DIR, 'classes')
class_dim.write.mode('overwrite').parquet(class_file)

***
#### Date dimension
This dimension is based on all date entries in tweet mentions. The ```create_at``` datetime is splitted in the following fields: ```year```, ```month```, ```day```, ```hour```, ```minute```

In [29]:
from pyspark.sql.functions import year, month, quarter, dayofweek, dayofmonth, hour, minute

date_w = Window.orderBy('year', 'month', 'day', 'hour', 'minute')

date_dim = mentions.select(
                        year(col('dt')).alias('year'),
                        quarter(col('dt')).alias('quarter'),
                        month(col('dt')).alias('month'),
                        dayofmonth(col('dt')).alias('day'),
                        dayofweek(col('dt')).alias('weekday'),
                        hour(col('dt')).alias('hour'),
                        minute(col('dt')).alias('minute')) \
                .dropDuplicates() \
                .withColumn('date_id', row_number().over(date_w))

Date dimension is persisted as parquet file in ```date``` folder inside the ```OUTPUT_DIR``` variable.

In [30]:
date_file = os.path.join(OUTPUT_DIR, 'date')
date_dim.write.mode('overwrite').parquet(date_file)

***
#### User Dimension
This step selects the columns and rename them:
* ```user.id``` -> ```user_id```
* ```user.name``` -> ```user_name```
* ```user.screen_name``` -> ```screen_name```
* ```user.description``` -> ```description```
* ```user.profile_image_url``` -> ```profile_image_url```
* ```user.profile_image_url_https``` -> ```profile_image_url_https```

In [31]:
user_dim = mentions.select(
            'user_id',
            'user_name',
            col('user_screen_name').alias('screen_name'),
            col('user_description').alias('description'),
            col('user_profile_image_url').alias('profile_image_url'),
            col('user_profile_image_url_https').alias('profile_image_url_https')
)

User dimension is persisted as parquet file in ```users``` folder inside the ```OUTPUT_DIR``` variable.

In [32]:
users_file = os.path.join(OUTPUT_DIR, 'users')
user_dim.write.mode('overwrite').parquet(users_file)

***
#### Tweet Dimension

To model this dimension, is join to the ```date``` dimension by year, month, day, hour and minute and get the following fields:
* tweet_id
* date_id
* user_id
* text
* reply_tweet_id

In [33]:
from pyspark.sql.functions import substring

tweet_dim = mentions.join(date_dim,
              (year(mentions.dt) == date_dim.year) &
              (month(mentions.dt) == date_dim.month) &
              (dayofmonth(mentions.dt) == date_dim.day) &
              (hour(mentions.dt) == date_dim.hour) &
              (minute(mentions.dt) == date_dim.minute),
              'inner'
             ) \
        .select(col('id').alias('tweet_id'),
                'date_id',
                col('`user_id`').alias('user_id'),
                col('full_text').alias('text'),
                col('in_reply_to_status_id').alias('reply_tweet_id')
               )

Tweet dimension is persisted as parquet file in ```tweets``` folder inside the ```OUTPUT_DIR``` variable.

In [34]:
tweets_file = os.path.join(OUTPUT_DIR, 'tweets')
tweet_dim.coalesce(2).write.mode('overwrite').parquet(tweets_file)

***
#### Fact table
In the tweet text any station or line can be mentioned. So, we used the regex included in lines and stations dataset to search those lines and stations in tweet text.

First, we need to extract two dictionaries with ids and regex for lines and stations. Then, we search every line and station regex in all tweet text, creating two new columns with an array that contains the ids of lines/stations mentioned in it.

<img src="./img/regex_web.png" style="height:450px;margin-left:auto;margin-right:auto;border:1px solid #888;"/>

First of all, 3 methods are defined:
* ```gen_tags()```. This method returns an array with tags. This tags are based on the dictionary with regex and the given text.
    If any regex is satisfied for the given text, the key of the dictionary is appended to the tag array.
* ```gen_line_tags()```. This method returns the an array tag with all line_id found in the given text.
* ```gen_station_tags()```. This method returns a tag array with all station_id found in the given text.

Once those methods are defined, they are used to create 2 ```udf```:
* ```gen_line_tags_udf```
* ```gen_station_tags_udf```

In [35]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
import re

lines_dict = {elem.line_id: elem.regex for elem in lines_dim.collect()}
stations_dict = {elem.station_id: elem.regex for elem in stations_dim.collect()}

def gen_tags(text,dicc):
    """
    Returns an array with tags. This tags are based on the dictionary with regex and the given text.
    If any regex is satisfied for the given text, the key of the dictionary is appended to the tag array.
    """
    tags = []
    for key, expr in dicc.items():
        if re.search(expr, text, re.IGNORECASE):
            tags.append(key)
    return tags

def gen_line_tags(text):
    """
    Returns an tag array with all line_id found in the given text.
    """
    return gen_tags(text, lines_dict)
    
def gen_station_tags(text):
    """
    Returns a tag array with all station_id found in the given text.
    """
    return gen_tags(text, stations_dict)

gen_line_tags_udf = udf(gen_line_tags, ArrayType(StringType()))
gen_station_tags_udf = udf(gen_station_tags, ArrayType(StringType()))

Those ```udf``` functions are used to generate to new array columns: ```lines``` constains all line_id founded in ```full_text``` column using ```gen_line_tags_udf```, ```stations``` performs ```gen_station_tags_udf``` to generate all station_id founded in ```full_text```.

In [36]:
fact_aux = mentions.withColumn('lines', gen_line_tags_udf(col('full_text'))) \
                     .withColumn('stations', gen_station_tags_udf(col('full_text'))) \
                        .persist(StorageLevel.MEMORY_ONLY_SER)

When fact table has two array columns, we join to class dimentions to get class_id. After that, it does an ```explode_outer``` to get one row per array entry (in both columns).

In [37]:
from pyspark.sql.functions import explode_outer

fact_w = Window.orderBy('id')

fact_df = fact_aux \
        .join(class_dim, col('classification') == class_dim.class_name, 'inner') \
        .withColumn('issue_id', row_number().over(fact_w)) \
        .select('issue_id', col('id').alias('tweet_id'), 'class_id', 'lines', 'stations') \
        .select('issue_id', 'tweet_id', 'class_id', explode_outer('lines').alias('line_id'), 'stations') \
        .select('issue_id', 'tweet_id', 'class_id', 'line_id', explode_outer('stations').alias('station_id'))

Fact table is persisted as parquet file in ```fact_table``` folder inside the ```OUTPUT_DIR``` variable.

In [38]:
fact_file = os.path.join(OUTPUT_DIR, 'fact_table')
fact_df.coalesce(4).write.mode('overwrite').parquet(fact_file)

### 4.2 Data Quality Checks

First, it reads the data persisted.

In [39]:
date_stored = spark.read.parquet(date_file)
user_stored = spark.read.parquet(users_file)
class_stored = spark.read.parquet(class_file)
tweet_stored = spark.read.parquet(tweets_file)
station_stored = spark.read.parquet(stations_file)
line_stored = spark.read.parquet(lines_file)
fact_stored = spark.read.parquet(fact_file)

The following step checks that all dimensions hasn't null ids.

In [46]:
print(f"# of null values in Date Dimension: {date_stored.filter(col('date_id').isNull()).count()}")
print(f"# of null values in User Dimension: {user_stored.filter(col('user_id').isNull()).count()}")
print(f"# of null values in Class Dimension: {class_stored.filter(col('class_id').isNull()).count()}")
print(f"# of null values in Tweet Dimension: {tweet_stored.filter(col('tweet_id').isNull() | col('user_id').isNull() | col('date_id').isNull()).count()}")
print(f"# of null values in Station Dimension: {station_stored.filter(col('station_id').isNull()).count()}")
print(f"# of null values in Line Dimension: {line_stored.filter(col('line_id').isNull()).count()}")

# of null values in Date Dimension: 0
# of null values in User Dimension: 0
# of null values in Class Dimension: 0
# of null values in Tweet Dimension: 0
# of null values in Station Dimension: 0
# of null values in Line Dimension: 0


All facts must contains ```issue_id```, ```tweet_id``` and ```class_id```.

In [45]:
fact_stored.filter(col('issue_id').isNull() | col('tweet_id').isNull() | col('class_id').isNull()).count()

0

The following cell is a unit text for ```gen_tags``` method. It checks the correct behaviour of this method.

In [43]:
dicc_test = {
    'L1': 'l(i|í)neas? ?(^|\D)1(\D|$)| L ?(^|\D)1(\D|$)',
    'L6': 'l(i|í)neas? ?(^|\D)6(\D|$)| L ?(^|\D)6(\D|$)'
}
text_test = '@metro_madrid informar de que hay un hombre borracho, línea 6 dirección ciudad universitaria vagón s_8495 ocupa todo los asientos. gracias.'
expected = ['L6']
print(f"GIVEN \n\ta dictionary with regex expression: {dicc_test}\n\tand a given text: {text_test}\n\nWHEN\n\tgen_tags(\'{text_test}\', \'{dicc_test}\')\n\nTHEN\n\tresult should be equal to {expected}: {gen_tags(text_test, dicc_test) == expected}")

GIVEN 
	a dictionary with regex expression: {'L1': 'l(i|í)neas? ?(^|\\D)1(\\D|$)| L ?(^|\\D)1(\\D|$)', 'L6': 'l(i|í)neas? ?(^|\\D)6(\\D|$)| L ?(^|\\D)6(\\D|$)'}
	and a given text: @metro_madrid informar de que hay un hombre borracho, línea 6 dirección ciudad universitaria vagón s_8495 ocupa todo los asientos. gracias.

WHEN
	gen_tags('@metro_madrid informar de que hay un hombre borracho, línea 6 dirección ciudad universitaria vagón s_8495 ocupa todo los asientos. gracias.', '{'L1': 'l(i|í)neas? ?(^|\\D)1(\\D|$)| L ?(^|\\D)1(\\D|$)', 'L6': 'l(i|í)neas? ?(^|\\D)6(\\D|$)| L ?(^|\\D)6(\\D|$)'}')

THEN
	result should be equal to ['L6']: True


### 4.3 Data dictionary 

#### Dimension Tables

* **Line**
    * ```line_id```
        * ```Integer```
        * Line identifier.
    * ```line_name```
        * ```String```
        * Line name.
    * ```regex```
        * ```String```
        * Regex to search the line in a text.
* **Station**
    * ```Integer```
        * ```Long```
        * Station identifier.
    * ```station_name```
        * ```String```
        * Station name.
    * ```regex```
        * ```String```
        * Regex to search the station in a text.
* **Class**
    * ```class_id```
        * ```Integer```
        * Station identifier.
    * ```class_name```
        * ```String```
        * Station identifier.
* **User**
    * ```user_id```
        * ```Long```
        * User identifier.
    * ```user_name```
        * ```String```
        * User name.
    * ```screen_name```
        * ```String```
        * User unique string identifier.
    * ```description```
        * ```String```
        * User description.
    * ```profile_image_url```
        * ```String```
        * URL of profile image (HTTP protocol).
    * ```profile_image_url_https```
        * ```String```
        * URL of profile image (HTTPS protocol).
* **Date**
    * ```date_id```
        * ```Long```
        * Date identifier.
    * ```year```
        * ```Integer```
        * Year number (i.e.: 2019, 2020, 2021, ...).
    * ```quarter```
        * ```Integer```
        * Quarter of the year (i.e.: 1, 2, ...).
    * ```month```
        * ```Integer```
        * Month of the year as integer (i.e.: 1, 2, 3, 4, ...).
    * ```weekday```
        * ```Integer```
        * Day of the week (Sunday=1, Monday=2, ..., Saturday=7).
    * ```day```
        * ```Integer```
        * Day of the month (1, 2, 3, ...).
    * ```hour```
        * ```Integer```
        * Hour in 24-hour format (i.e.: 0, 1, 2, ..., 12, 13, 14, ..., 22, 23).
    * ```minute```
        * ```Integer```
        * Minute (i.e.: 0, 1, 2, 3, 4, ..., 59)
* **Tweet**
    * ```tweet_id```
        * ```Long```
        * Tweet identifier.
    * ```date_id```
        * ```Long```
        * Date id when the tweet was created.
    * ```user_id```
        * ```Long```
        * User id who is author of this tweet.
    * ```text```
        * ```String```
        * Tweet text.
    * ```reply_tweet_id```
        * ```Long```
        * If this tweet is a reply, this field references the tweet_id that this tweet is replying.

#### Fact Table

* **Inform**
    * ```issue_id```
        * ```Long```
        * Inform identifier.
    * ```tweet_id```
        * ```Long```
        * Tweet id that informs about a issue or complaint.
    * ```line_id```
        * ```Long```
        * Service line id.
    * ```station_id```
        * ```Long```
        * Station id.
    * ```class_id```
        * ```Long```
        * If this tweet is a reply, this field references the tweet_id that this tweet is replying.

***
[Back to top](#top)
### Step 5: Project Write Up <a class="anchor" id="step-5"></a>

This project uses [Apache Spark](https://spark.apache.org/) to process large amount of date. This dataset is not quite large yet but it grows every day. In the future, other tube services from other cities can be integrated in the system and increase the number of tweets to process.

When the record number increase by 100x, more executors need to be added. Also, in order to have a dashboard feeded by data and updated daily or hourly, [Apache Airflow](https://airflow.apache.org/) is needed to orchestrate MongoDB extraction, transformation and cleaning steps, etc.

Now, the database is stored in parquet files. But when it is accessed by 100+ people, the data must be migrated to another system (Redshift, HDFS, etc.)