Let's start with your project: 

Are you a data scientist? 

I think you are an awesome a data scientist.

### **Problem** 
**Our goal is to create a predictive model that can answer the following question:**

**What kind of people had a better chance of surviving?**

**Data about passengers:**
*   Name
*   Age
*   Gender.


## Install and Import Libraries
Let's install PySpark:

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

## Build Spark Session

In [2]:
spark = SparkSession.builder.getOrCreate()

## Data Loading


You have two datasets: 
* Train  
* Test.

Read two datasets: 
* Train
* Test.



In [3]:
df_train = spark.read.csv('Data/train.csv',header = True,inferSchema=True)
df_test = spark.read.csv('Data/train.csv',header = True,inferSchema=True)

Let's work with train dataset:

**Confirm if this is a dataframe or not:**

In [4]:
df_train

DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]

**Show 5 rows.**

In [5]:
df_train.show(5, truncate=False)

+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|Name                                               |Sex   |Age |SibSp|Parch|Ticket          |Fare   |Cabin|Embarked|
+-----------+--------+------+---------------------------------------------------+------+----+-----+-----+----------------+-------+-----+--------+
|1          |0       |3     |Braund, Mr. Owen Harris                            |male  |22.0|1    |0    |A/5 21171       |7.25   |null |S       |
|2          |1       |1     |Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38.0|1    |0    |PC 17599        |71.2833|C85  |C       |
|3          |1       |3     |Heikkinen, Miss. Laina                             |female|26.0|0    |0    |STON/O2. 3101282|7.925  |null |S       |
|4          |1       |1     |Futrelle, Mrs. Jacques Heath (Lily May Peel)       |female|35.0|1    |0    |113803          |53

**Display schema for the dataset:**

In [6]:
df_train.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



**Statistical summary:**

In [7]:
df_train.describe().show(vertical=True)

-RECORD 0---------------------------
 summary     | count                
 PassengerId | 891                  
 Survived    | 891                  
 Pclass      | 891                  
 Name        | 891                  
 Sex         | 891                  
 Age         | 714                  
 SibSp       | 891                  
 Parch       | 891                  
 Ticket      | 891                  
 Fare        | 891                  
 Cabin       | 204                  
 Embarked    | 889                  
-RECORD 1---------------------------
 summary     | mean                 
 PassengerId | 446.0                
 Survived    | 0.3838383838383838   
 Pclass      | 2.308641975308642    
 Name        | null                 
 Sex         | null                 
 Age         | 29.69911764705882    
 SibSp       | 0.5230078563411896   
 Parch       | 0.38159371492704824  
 Ticket      | 260318.54916792738   
 Fare        | 32.2042079685746     
 Cabin       | null                 
 

## EDA - Exploratory Data Analysis

**Display count for the train dataset:**

In [8]:
df_train.count()

891

**Can you answer this question:** 

**How many people survived, and how many didn't survive?** 

**Please save data in a variable.**

In [9]:
import pyspark.sql.functions as F
survived = df_train.select("Survived").where(F.col("Survived") == 1).count()
not_survived = df_train.select("Survived").where(F.col("Survived") == 0).count()

**Display your result:**

In [10]:
print("survived: {}\ndidn't survive: {}".format(survived, not_survived))

survived: 342
didn't survive: 549


**Can you display your answer in ratio form?(Hint: Use UDF.)**






In [11]:
survived_ratio = survived/df_train.dropna(subset=["survived"]).count()
not_survived_ratio = not_survived/df_train.dropna(subset=["survived"]).count()

In [12]:
print("survived ratio: {}\ndidn't survive ratio: {}".format(survived_ratio, not_survived_ratio))

survived ratio: 0.3838383838383838
didn't survive ratio: 0.6161616161616161


**Can you get the number of males and females?**


In [13]:
n_males = df_train.select("Sex").where(F.col("Sex") == 'male').count()
n_females = df_train.select("Sex").where(F.col("Sex") == 'female').count()
print("No. males: {}\nNo. females: {}".format(n_males, n_females))

No. males: 577
No. females: 314


**1. What is the average number of survivors of each gender?**

**2. What is the number of survivors of each gender?**

(Hint: Group by the "sex" column.)

In [14]:
df_train.groupBy("Sex").agg(F.avg(df_train.Survived)).show()

+------+-------------------+
|   Sex|      avg(Survived)|
+------+-------------------+
|female| 0.7420382165605095|
|  male|0.18890814558058924|
+------+-------------------+



**Create temporary view PySpark:**

In [15]:
df_train.createOrReplaceTempView("people")
spark.sql("select * from people").show(10)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

**How many people survived, and how many didn't survive? By SQL:**

In [16]:
spark.sql("SELECT COUNT(Survived) as n_survived FROM people WHERE Survived == 0 ").show()

+----------+
|n_survived|
+----------+
|       549|
+----------+



**Can you display the number of survivors from each gender as a ratio?**

(Hint: Group by "sex" column.)

**Can you do this via SQL?**

In [17]:
spark.sql("SELECT Sex, AVG(Survived) FROM people GROUP BY Sex").show()

+------+-------------------+
|   Sex|      avg(Survived)|
+------+-------------------+
|female| 0.7420382165605095|
|  male|0.18890814558058924|
+------+-------------------+



**Display a ratio for p-class:**


In [18]:
spark.sql("SELECT Pclass, ROUND(AVG(Survived),3) as Pclass_ratio FROM people GROUP BY Pclass").show()

+------+------------+
|Pclass|Pclass_ratio|
+------+------------+
|     1|        0.63|
|     3|       0.242|
|     2|       0.473|
+------+------------+



**Let's take a break and continue after this.**

## Data Cleaning

**First and foremost, we must merge both the train and test datasets. (Hint: The union function can do this.)**



In [19]:
union_df = df_train.union(df_test)

**Display count:**

In [20]:
union_df.count()

1782

**Temporary view PySpark:**

In [21]:
union_df.createOrReplaceTempView("all_people")
spark.sql("SELECT * FROM all_people").show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

**Can you define the number of null values in each column?**


In [22]:
union_df.select([F.count(F.when(F.isnull(c) | F.col(c).isNull(), c)).alias(c)
                 for c in union_df.columns]).show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|354|    0|    0|     0|   0| 1374|       4|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



**Create Dataframe for null values**

1. Column
2. Number of missing values.

In [23]:
null_df = union_df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c)
                           for c in union_df.columns])
null_df.show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|354|    0|    0|     0|   0| 1374|       4|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



## Preprocessing 

**Can you show me the name column from your temporary table?**

In [24]:
spark.sql("SELECT Name FROM all_people").show(truncate=False)

+-------------------------------------------------------+
|Name                                                   |
+-------------------------------------------------------+
|Braund, Mr. Owen Harris                                |
|Cumings, Mrs. John Bradley (Florence Briggs Thayer)    |
|Heikkinen, Miss. Laina                                 |
|Futrelle, Mrs. Jacques Heath (Lily May Peel)           |
|Allen, Mr. William Henry                               |
|Moran, Mr. James                                       |
|McCarthy, Mr. Timothy J                                |
|Palsson, Master. Gosta Leonard                         |
|Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      |
|Nasser, Mrs. Nicholas (Adele Achem)                    |
|Sandstrom, Miss. Marguerite Rut                        |
|Bonnell, Miss. Elizabeth                               |
|Saundercock, Mr. William Henry                         |
|Andersson, Mr. Anders Johan                            |
|Vestrom, Miss

**Run this code:**

In [25]:
combined = union_df.withColumn('Title',F.regexp_extract(F.col("Name"),"([A-Za-z]+)\.",1))
combined.createOrReplaceTempView('combined')

**Display the title and count "Title" column:**

In [26]:
spark.sql("SELECT Title FROM combined").show()
spark.sql("SELECT COUNT(Title) FROM combined").show()

+------+
| Title|
+------+
|    Mr|
|   Mrs|
|  Miss|
|   Mrs|
|    Mr|
|    Mr|
|    Mr|
|Master|
|   Mrs|
|   Mrs|
|  Miss|
|  Miss|
|    Mr|
|    Mr|
|  Miss|
|   Mrs|
|Master|
|    Mr|
|   Mrs|
|   Mrs|
+------+
only showing top 20 rows

+------------+
|count(Title)|
+------------+
|        1782|
+------------+



In [27]:
spark.sql("SELECT DISTINCT(Title) FROM combined").show()

+--------+
|   Title|
+--------+
|     Don|
|    Miss|
|Countess|
|     Col|
|     Rev|
|    Lady|
|  Master|
|     Mme|
|    Capt|
|      Mr|
|      Dr|
|     Mrs|
|     Sir|
|Jonkheer|
|    Mlle|
|   Major|
|      Ms|
+--------+



**We can see that Dr, Rev, Major, Col, Mlle, Capt, Don, Jonkheer, Countess, Ms, Sir, Lady, and Mme are really rare titles, so create Dictionary and set the value to "rare".**

In [28]:
mapping={'Dr':'rare','Rev':'rare','Major':'rare','Col':'rare','Mlle':'rare','Capt':'rare',
        'Don':'rare','Jonkheer':'rare','Countess':'rare','Ms':'rare','Sir':'rare','Lady':'rare','Mme':'rare','Mr':'Mr'
        ,'Mrs':'Mrs','Miss':'Miss','Master':'Master'}
combined_title = combined.replace(to_replace=mapping, subset=['Title'])
combined_title.select("Title").show()

+------+
| Title|
+------+
|    Mr|
|   Mrs|
|  Miss|
|   Mrs|
|    Mr|
|    Mr|
|    Mr|
|Master|
|   Mrs|
|   Mrs|
|  Miss|
|  Miss|
|    Mr|
|    Mr|
|  Miss|
|   Mrs|
|Master|
|    Mr|
|   Mrs|
|   Mrs|
+------+
only showing top 20 rows



In [29]:
print(combined_title.select("Title").distinct().count())
combined_title.select("Title").distinct().show()

5
+------+
| Title|
+------+
|  rare|
|  Miss|
|Master|
|    Mr|
|   Mrs|
+------+



**Run the function:**

In [30]:
def impute_title(title):
    return mapping[title]

**Apply the function on "Title" column using UDF:**

In [31]:
replaceUDF = F.udf(lambda z: impute_title(z))
combined = combined.withColumn("Title", replaceUDF(F.col("Title")))
combined.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked| Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|    Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|   Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|  Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|   Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|    Mr|
|          6|       0|  

**Display "Title" from table and group by "Title" column:**

In [32]:
combined.select("Title").groupBy("Title").count().show()

+------+-----+
| Title|count|
+------+-----+
|  rare|   54|
|  Miss|  364|
|Master|   80|
|    Mr| 1034|
|   Mrs|  250|
+------+-----+



## **Preprocessing Age**

**Based on the age mean, you will fill in the missing age values:**

In [33]:
mean_age = combined.select(F.mean(combined['Age'])).collect()[0][0]
mean_age = round(mean_age,2)
print(mean_age)

29.7


**Fill missing age with age mean:**

In [34]:
combined = combined.na.fill(mean_age, subset=['Age'])
combined.select("Age").show(10)

+----+
| Age|
+----+
|22.0|
|38.0|
|26.0|
|35.0|
|35.0|
|29.7|
|54.0|
| 2.0|
|27.0|
|14.0|
+----+
only showing top 10 rows



## **Preprocessing Embarked**

**Select Embarked, count them, order by count Desc, and save in grouped_Embarked variable:**




In [35]:
grouped_Embarked = combined.select("Embarked").groupBy("Embarked").count().orderBy(F.desc("count"))

**Show groupped_Embarked:**

In [36]:
grouped_Embarked.show()

+--------+-----+
|Embarked|count|
+--------+-----+
|       S| 1288|
|       C|  336|
|       Q|  154|
|    null|    4|
+--------+-----+



**Get the groupped_Embarked:** 

**Fill missing values with Top 'S' of grouped_Embarked:**

In [37]:
combined = combined.na.fill('S', subset=['Embarked'])
grouped_Embarked = combined.select("Embarked").groupBy("Embarked").count().orderBy(F.desc("count"))
grouped_Embarked.show()

+--------+-----+
|Embarked|count|
+--------+-----+
|       S| 1292|
|       C|  336|
|       Q|  154|
+--------+-----+



In [38]:
combined.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+-----+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|   Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|  Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S| Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|  Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|   Mr|
+-----------+--------+------+---

## **Preprocessing Cabin**

**Replace "cabin" column with first char from the string:**



In [39]:
combined = combined.withColumn('Cabin', split(col('Cabin'),"").getItem(0))

**Show the result:**

In [40]:
combined.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked| Title|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|    Mr|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|    C|       C|   Mrs|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|  Miss|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1|    C|       S|   Mrs|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|    Mr|
|          6|       0|  

**Create the temporary view:**

In [41]:
combined.createOrReplaceTempView('combined')

**Select "Cabin" column, count Cabin column, Group by "Cabin" column, Order By count DESC**  

In [43]:
combined.select("Cabin").groupBy("Cabin").count().orderBy(F.desc("count")).show()

+-----+-----+
|Cabin|count|
+-----+-----+
| null| 1374|
|    C|  118|
|    B|   94|
|    D|   66|
|    E|   64|
|    A|   30|
|    F|   26|
|    G|    8|
|    T|    2|
+-----+-----+



**Fill missing values with "U":**

In [44]:
combined = combined.na.fill('U', subset=['Cabin'])

**StringIndexer: A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.**

**StringIndexer(inputCol=None, outputCol=None)**

**Pipeline: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.**

____________________________________________

**Use Pipline to fit and transform:**

In [46]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

**VectorAssembler: VectorAssembler(*, inputCols=None, outputCol=None) A feature transformer that merges multiple columns into a vector column.**



**Use randomSplit function and split data to x_train, and X_test with 80% and 20% Consecutive**

In [47]:
X_train, X_test = combined.randomSplit([0.8,0.2],seed=0)
print(f"There are {X_train.count()} rows in the training set, and {X_test.count()} in the test set.")

There are 1435 rows in the training set, and 347 in the test set.


In [48]:
categoricalColumns = [col for (col, dtype) in X_train.dtypes
                   if dtype == "string"]
categoricalColumns           

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Title']

In [49]:
categoricalColumns.remove("Name")
categoricalColumns

['Sex', 'Ticket', 'Cabin', 'Embarked', 'Title']

In [50]:
indexOutputColumns = [x + "_Index" for x in categoricalColumns]
indexOutputColumns

['Sex_Index', 'Ticket_Index', 'Cabin_Index', 'Embarked_Index', 'Title_Index']

In [51]:
oheOutputColumns = [x + "_OHE" for x in categoricalColumns]
oheOutputColumns

['Sex_OHE', 'Ticket_OHE', 'Cabin_OHE', 'Embarked_OHE', 'Title_OHE']

In [52]:
stringIndexer = StringIndexer(inputCols=categoricalColumns,
                             outputCols=indexOutputColumns,
                             handleInvalid='skip')
oheEncoder = OneHotEncoder(inputCols=indexOutputColumns,
                          outputCols=oheOutputColumns)

numericColumns = [field for (field,dataType) in X_train.dtypes
              if ((dataType=='double'or dataType=='int')& (field!='Survived'))]
numericColumns.remove('PassengerId')
numericColumns

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [53]:
assemblerInputs = oheOutputColumns + numericColumns
assemblerInputs

['Sex_OHE',
 'Ticket_OHE',
 'Cabin_OHE',
 'Embarked_OHE',
 'Title_OHE',
 'Pclass',
 'Age',
 'SibSp',
 'Parch',
 'Fare']

In [54]:
from pyspark.ml.feature import VectorAssembler

vecAssembler = VectorAssembler(inputCols=assemblerInputs,outputCol='features')

**Build RandomForestClassifier model and use pipeline to fit and transform then display "prediction, Survived, features" columns**

In [55]:
rfc = RandomForestClassifier(featuresCol='features', labelCol='Survived', predictionCol='prediction', maxDepth=5)
pipeline =Pipeline(stages = [stringIndexer,oheEncoder,vecAssembler,rfc])

In [56]:
pipelineModel = pipeline.fit(X_train)

In [57]:
predDF = pipelineModel.transform(X_test)

In [58]:
predDF.select('features','Survived','prediction').show(5)

+--------------------+--------+----------+
|            features|Survived|prediction|
+--------------------+--------+----------+
|(686,[0,82,667,67...|       0|       0.0|
|(686,[0,528,670,6...|       1|       0.0|
|(686,[558,667,678...|       1|       0.0|
|(686,[0,642,667,6...|       0|       0.0|
|(686,[0,537,667,6...|       1|       0.0|
+--------------------+--------+----------+
only showing top 5 rows



**Use MulticlassClassificationEvaluator and set the "labelCol" to "Survived",  "predictionCol" to "prediction", "metricName" to "accuracy"** 

In [59]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
MCE = MulticlassClassificationEvaluator(predictionCol='prediction',
                                        labelCol='Survived',
                                        metricName='accuracy')

In [60]:
accuracy = MCE.evaluate(predDF)
accuracy

0.7586206896551724

**When you are finished send the project via Google classroom**
**Please let me know if you have any questions.**
* nabieh.mostafa@yahoo.com
* +201015197566 (Whatsapp)

**Don't Hate me, I push you to learn**

**I will help you to become an awesome data engineer.**

**Why did I say that "Data Engineer"?**

**Tricky question, but an optional question, if you would like to know the answer, ask me.**
