# MPG Cars

### Introduction:

The following exercise utilizes data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

spark = SparkSession.builder.appName("Auto_MPG").getOrCreate()

### Step 2. Import the first dataset [cars1](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv) and [cars2](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv).  

   ### Step 3. Assign each to a variable called cars1 and cars2

In [3]:
url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv"
url2 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv"

from pyspark import SparkFiles
spark.sparkContext.addFile(url1)
spark.sparkContext.addFile(url2)


cars1 = spark.read.csv(SparkFiles.get("cars1.csv"),header=True, inferSchema= True)
cars2 = spark.read.csv(SparkFiles.get("cars2.csv"),header=True, inferSchema= True)


### Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

In [4]:
cars1.printSchema

<bound method DataFrame.printSchema of DataFrame[mpg: double, cylinders: int, displacement: int, horsepower: string, weight: int, acceleration: double, model: int, origin: int, car: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string]>

In [8]:
cars1 = cars1[cars1.columns[:cars1.columns.index('car') + 1]]

In [10]:
cars1.head(5)

[Row(mpg=18.0, cylinders=8, displacement=307, horsepower='130', weight=3504, acceleration=12.0, model=70, origin=1, car='chevrolet chevelle malibu'),
 Row(mpg=15.0, cylinders=8, displacement=350, horsepower='165', weight=3693, acceleration=11.5, model=70, origin=1, car='buick skylark 320'),
 Row(mpg=18.0, cylinders=8, displacement=318, horsepower='150', weight=3436, acceleration=11.0, model=70, origin=1, car='plymouth satellite'),
 Row(mpg=16.0, cylinders=8, displacement=304, horsepower='150', weight=3433, acceleration=12.0, model=70, origin=1, car='amc rebel sst'),
 Row(mpg=17.0, cylinders=8, displacement=302, horsepower='140', weight=3449, acceleration=10.5, model=70, origin=1, car='ford torino')]

### Step 5. What is the number of observations in each dataset?

In [11]:
cars1.count()

198

In [12]:
cars2.count()

200

### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [14]:
cars = cars1.union(cars2)

In [15]:
cars.count()

398

### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [20]:
cars1.printSchema

<bound method DataFrame.printSchema of DataFrame[mpg: double, cylinders: int, displacement: int, horsepower: string, weight: int, acceleration: double, model: int, origin: int, car: string]>

In [88]:
import random
spark_n = F.udf(lambda: random.randint(15000, 73001), T.IntegerType())

### Step 8. Add the column owners to cars

In [89]:
cars = cars.withColumn('owners', spark_n())

In [90]:
cars.head(5)

[Row(mpg=18.0, cylinders=8, displacement=307, horsepower='130', weight=3504, acceleration=12.0, model=70, origin=1, car='chevrolet chevelle malibu', owners=66265),
 Row(mpg=15.0, cylinders=8, displacement=350, horsepower='165', weight=3693, acceleration=11.5, model=70, origin=1, car='buick skylark 320', owners=60175),
 Row(mpg=18.0, cylinders=8, displacement=318, horsepower='150', weight=3436, acceleration=11.0, model=70, origin=1, car='plymouth satellite', owners=16667),
 Row(mpg=16.0, cylinders=8, displacement=304, horsepower='150', weight=3433, acceleration=12.0, model=70, origin=1, car='amc rebel sst', owners=20570),
 Row(mpg=17.0, cylinders=8, displacement=302, horsepower='140', weight=3449, acceleration=10.5, model=70, origin=1, car='ford torino', owners=59560)]