<a href="https://colab.research.google.com/github/mosesyhc/de300-wn2024-notes/blob/main/lab/DATAENG300_Lab7_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 7 - Map Reduce for logistic regression

**Before you begin**, make a copy of this notebook via `File -> Save a Copy` or `Copy to Drive` above.  Rename to include your name.


---

## Lab
This lab applies map reduce to logistic regression with the titanic dataset.

A logistic regression has the following log-likelihood function:

$$\displaystyle \ell(\beta) = \sum_{i=1}^n y_i(\beta^\mathsf{T} \mathbf{x_i}) - \sum_{i=1}^n\log (1 + \exp\{\beta^\mathsf{T} \mathbf{x_i}\})$$

### Tasks
Use `survived` as the response variable, and the other columns as predictors.
1. Write the two appropriate `map` functions for calculating the log-likelihood function.
    1. **Note:** the result $\ell(\beta)$ is just a number.  
3. Return the log-likelihood function value at `beta = {'pclass': -1.11, 'age': -0.03, 'fare': 0.00, 'sex01': -2.5}` using the functions above.

*Tip:* The linear regression example in class may be useful.  The complete notebook is found here: https://github.com/mosesyhc/de300-wn2024-notes/blob/main/examples/ex-linear-mr-complete.ipynb.

### (Similar) PySpark setup in Colab

In [None]:
!wget -q https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
!tar xf spark-3.4.0-bin-hadoop3.tgz

In [None]:
!pip install -q findspark
!pip install -q seaborn

In [None]:
# spark setup
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.0-bin-hadoop3"

In [None]:
# findspark helps locate the environment variables
import findspark
findspark.init()

### Dataset

In [2]:
import seaborn as sns
titanic = sns.load_dataset('titanic', data_home='dataset/', cache=True)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

In [None]:
titanic = spark.read.csv('dataset/titanic.csv', header=True, inferSchema=True)

# to focus on mapreduce, we only retain the following columns
titanic = titanic \
          .select(['survived', 'pclass', 'sex', 'age', 'fare']) \
          .withColumn('sex01', (F.col('sex') == 'male').cast(IntegerType())) \
          .drop('sex')
# the data are "cleaned" to obtain have complete data
age_mean = titanic.groupBy().mean('Age').first()[0]
titanic = titanic.na.fill({'Age': age_mean})
# view summary of data
titanic.describe().toPandas()

### Template code

In [None]:
predictors = ['pclass', 'age', 'fare', 'sex01']
response = 'survived'

# consider beta as fixed and callable from the maps
beta = {'pclass': -1.11, 'age': -0.03, 'fare': 0.00, 'sex01': -2.5}  

In [None]:
# map
def ybetax_map(row):
  row = row.asDict()
  for i in predictors:
      yield  # returns the appropriate value given a row

def logterm_map(row):
  # we may use numpy functions np.log1p(), np.exp() in the map
  row = row.asDict()
  val = 0
  for i in predictors:
      val += ()
  return  # returns the appropriate value given a row

In [None]:
# reduce
# .reduce() directly maybe helpful since the result is a scalar

**Submission:**
You will submit the `.ipynb` notebook file and any supporting information you see fit.

# Generative AI disclosure
In this course, you are generally allowed to use Generative Artificial Intelligence (GAI). Any use of GAI should be accompanied by a disclosure at the end of an assignment explaining (1) what you used GAI for; (2) the specific tool(s) you used; and (3) what prompts you used to get the results.

**Include** any disclosure below.