# Mini-project: MapReduce - Car prices

This notebook demonstrates MapReduce analytics on a car sales dataset using Python and MRJob. It was created as part of my Big Data class and serves as my first hands-on project using MapReduce.

## Input file description

We are using the `car_prices.csv` input file. It is a Comma Separated Values (CSV) file that provides information pertaining to the sales transactions of various vehicles. The dataset comprises 558837 rows and 16 columns and occupies around 85 MB.

Each row denotes a car sale. The columns represent:

1. `year`: The manufacturing year of the vehicle.
2. `make`: The brand or manufacturer of the vehicle.
3. `model`: The specific model of the vehicle.
4. `trim`: Additional designation for the vehicle model.
5. `body`: The body type of the vehicle (e.g., SUV, Sedan).
6. `transmission`: The type of transmission in the vehicle (e.g., automatic).
7. `vin`: Vehicle Identification Number, a unique code for each vehicle.
8. `state`: The state where the vehicle is registered.
9. `condition`: Condition of the vehicle, possibly rated on a scale.
10. `odometer`: The mileage or distance traveled by the vehicle.
11. `color`: Exterior color of the vehicle.
12. `interior`: Interior color of the vehicle.
13. `seller`: The entity selling the vehicle.
14. `mmr`: Manheim Market Report, possibly indicating the estimated market value of the vehicle.
15. `sellingprice`: The price at which the vehicle was sold.
16. `saledate`: The date and time when the vehicle was sold.

Dataset source: [https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data](https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data)

Similarly to almost all real-world datasets, this one has several data quality issues. One of these issues has to do with missing values. Therefore, there might be missing dates, brands, prices, colors, etc. Imputing missing values is out of the scope of this project. Consequently, in **all**  implementations,  **we simply ignore all the rows that have missing values on columns 1, 2, 10, 11, 12, 15, and 16.** Fortunately, only a small portion of the records is going to be lost with this strategy.


## 1: Compute Yearly Statistics

In this job we are interested in computing yearly statistics about car sales. In particular, we implement a MapReduce task that computes the:

* number of vehicles per year,
* total value of the sold cars per year,
* average distance travelled by the sold cars per year and
* average age of the sold cars per year.


In [None]:
%%file task1.py
#!/usr/bin/env python3

from mrjob.job import MRJob

class Task1(MRJob):

    def mapper(self, _, line):
        columns = line.split(",")
        
        # Skip rows with missing values for these columns
        for i in [0, 1, 9, 10, 11, 14, 15]:
            if not columns[i].strip():
                return

        saledate = columns[15]
        try:    
            year = int(saledate.split()[3])
        except IndexError:  # Probably an invalid date
            return
            
        price = float(columns[14])
        distance = float(columns[9])
        manufactured_on = int(columns[0])
        age = year - manufactured_on
        yield year, (1, price, distance, age)

    def reducer(self, year, values):
        total_count = 0
        total_price = 0
        total_distance = 0
        total_age = 0

        for count, price, distance, age in values:
            total_count += count
            total_price += price
            total_distance += distance
            total_age += age

        yield year, (total_count, total_price, total_distance/total_count, total_age/total_count)
  
if __name__ == '__main__':
    Task1.run()

### Running the code

In [None]:
# We run the job here in both standalone and distributed modes.

!python3 task1.py car_prices.csv 

!python3 task1.py -r hadoop car_prices.csv -o task1_out


## 2: Compute Yearly Statistics per Brand

In this job we are interested in the performance of the car sales **per brand, in a yearly fashion**. In fact, this will compute the same statistics as those of the job above, but also groupped by brand. More specifically, we are interested in computing the:

* number of vehicles per year, per brand,
* total value of the sold cars per year, per brand,
* average distance travelled by the sold cars per year, per brand and
* average age of the sold cars per year per brand.




In [None]:
%%file task2.py
#!/usr/bin/env python3

from mrjob.job import MRJob

class Task2(MRJob):

    def mapper(self, _, line):
        columns = line.split(",")
        
        # Skip rows with missing values for these columns
        for i in [0, 1, 9, 10, 11, 14, 15]:
            if not columns[i].strip():
                return  

        saledate = columns[15]
        try:    
            year = int(saledate.split()[3])
        except IndexError:  # Probably an invalid date
            return
            
        price = float(columns[14])
        distance = float(columns[9])
        manufactured_on = int(columns[0])
        age = year - manufactured_on
        brand = columns[1]
        yield (year, brand), (1, price, distance, age)

    def reducer(self, keys, values):
        total_count = 0
        total_price = 0
        total_distance = 0
        total_age = 0

        for count, price, distance, age in values:
            total_count += count
            total_price += price
            total_distance += distance
            total_age += age

        year, brand = keys

        yield (year, brand), (total_count, total_price, total_distance / total_count, total_age / total_count)
  
if __name__ == '__main__':
    Task2.run()

### Running the code

In [None]:
# We run the job here in both standalone and distributed modes.

!python3 task2.py car_prices.csv 

!python3 task2.py -r hadoop car_prices.csv -o task2_out


## 3: Large-scale analytics: Feature Exploration

In this job we will perform a part of what is called feature exploration. Feature exploration focuses on quantifying the impact of a particular feature on the target variable. While such analyses typically include all features, in this example we are only interested in finding out how the (exterior) color of a car affects its sales. More specifically, we need to compute:

* number of vehicles per (exterior) color, and
* the total value of the sold cars per color.



In [None]:
%%file task3.py
#!/usr/bin/env python3

from mrjob.job import MRJob

class Task3(MRJob):

    def mapper(self, _, line):
        columns = line.split(",")
        
        # Skip rows with missing values for these columns
        for i in [0, 1, 9, 10, 11, 14, 15]:
            if not columns[i]:
                return  
            
        price = float(columns[14])
        color = columns[10]
        if str.isdigit(color) or color == "\u2014":  # Trial and error showed numbers and dashes in some rows. Interpreting them as missing and skipping those rows.
            return
        yield color, (1, price)

    def reducer(self, color, values):
        total_count = 0
        total_price = 0

        for count, price in values:
            total_count += count
            total_price += price

        yield color, (total_count, total_price)
  
if __name__ == '__main__':
    Task3.run()

### Running the code

In [None]:
# We run the job here in both standalone and distributed modes.

!python3 task3.py car_prices.csv 

!python3 task3.py -r hadoop car_prices.csv -o task3_out


## 4: Exploratory Analysis

In this job we will perform a part of what is called exploratory analysis. This process applies unsupervised techniques to a data collection with the aim of discovering potentially useful information. In this example we are interested in finding out the combination of exterior and interior colors that sells most. More specifically, we need to compute:

* the number of vehicles per pair of exterior/interior color.



In [None]:
%%file task4.py
#!/usr/bin/env python3

from mrjob.job import MRJob

class Task4(MRJob):

    def mapper(self, _, line):
        columns = line.split(",")
        
        # Skip rows with missing values for these columns
        for i in [0, 1, 9, 10, 11, 14, 15]:
            if not columns[i]:
                return 
            
        color_ex = columns[10]
        color_in = columns[11]

        if str.isdigit(color_ex) or color_in == "\u2014" or color_ex == "\u2014":  # Same as previous task
            return
        yield (color_ex, color_in), 1

    def reducer(self, colors, counts):             
        color_ex, color_in = colors
        
        yield (color_ex, color_in), sum(counts)
  
if __name__ == '__main__':
    Task4.run()

### Running the code

In [None]:
# We run the job here in both standalone and distributed modes.

!python3 task4.py car_prices.csv 

!python3 task4.py -r hadoop car_prices.csv -o task4_out
