In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=1288332cef15c09d2bfddec1e70b1eafac1c48ed4beb49b3c3a8a8f53bc25774
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

sc = spark.sparkContext

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import shutil
# Importing the txt of the last notebook.
# This txt is changed to reach better results, in the conclusions we will talk about the changes I made it.

# Define the path to the txt file in your Google Drive
google_drive_csv_path = "/content/drive/MyDrive/UJI/Big Data/BigData NoteBooks/moviesmarks3"

# Define the path to save the txt file locally in your Colab environment
local_csv_path = "/content/moviesmark"

# Copy the CSV file from Google Drive to the local Colab environment
shutil.copy(google_drive_csv_path, local_csv_path)

# Now, you can access the file at local_csv_path and perform any operations you need

'/content/moviesmark'

In [5]:
df_model = spark.read.format('libsvm').load('/content/drive/MyDrive/UJI/Big Data/BigData NoteBooks/moviesmarks')

## Decision tree classifier

In [6]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [31]:
from pyspark.ml.feature import VectorAssembler

feature_cols = ["features"]
# Merges multiple columns into a vector column.
assembler = VectorAssembler(inputCols = feature_cols, outputCol = "features_vector")

# Split the data into training and test sets (20% held out for testing)
(trainingData, testData) = df_model.randomSplit([0.8, 0.2], seed = 37)


In [38]:
# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

# Chain assembler and tree in a Pipeline
pipeline = Pipeline(stages=[assembler, dt])

# Train model.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

In [39]:
# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Accuracy = %g " % (accuracy))

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       3.0|  1.0|(68514,[1,2,4,5,7...|
|       3.0|  1.0|(68514,[2,3,4,5,6...|
|       3.0|  1.0|(68514,[2,3,4,5,7...|
|       3.0|  1.0|(68514,[2,3,4,5,7...|
|       3.0|  1.0|(68514,[2,3,4,5,7...|
+----------+-----+--------------------+
only showing top 5 rows

Test Accuracy = 0.386189 


In [58]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=4, regParam=0.2)

# Chain assembler and tree in a Pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Train model.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Logistic Regression Accuracy = %g " % (accuracy))

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       2.0|  1.0|(68514,[1,2,4,5,7...|
|       1.0|  1.0|(68514,[2,3,4,5,6...|
|       2.0|  1.0|(68514,[2,3,4,5,7...|
|       1.0|  1.0|(68514,[2,3,4,5,7...|
|       2.0|  1.0|(68514,[2,3,4,5,7...|
+----------+-----+--------------------+
only showing top 5 rows

Logistic Regression Accuracy = 0.452685 


### Conclusions
Firstly, we will comment on the changes made to last week's Notebook document. We have made some changes in order to get a better result in the predictions. One of the first changes has been to filter out words that have a low IDF, namely greater than 7. This is because these words are words that are continuously repeated in the reviews and this could indicate that these words do not provide relevant information for our prediction, as they can be connectors, conjunctions or words that do not provide much meaning.
This change has been made with the following code in the RDD IDF from previous practice:

```
IDF = IDF.filter(lambda x: x[1] > 7.0)
```

Another of the changes, focused in the same direction, has been to eliminate what we call stopwords. This has been done using the **nltk** package and the following instructions:


```
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stops = set(stopwords.words('spanish'))
```

And then we filter in the **countWords** function of the previous Notebook.

Finally, to carry out our prediction we have used two ML-based models: Decission Tree Algorithm and Logistic regression. On these we have modified some hyperparameters to obtain a better result. Thus obtaining that the Logistic Regression model has a higher accuracy. This may be due to several factors: Decision trees are non-linear models that can create complex decision boundaries. In the context of predicting film scores based on reviews, a decision tree may overfit the data by creating a complex tree structure that captures noise in the training data, resulting in lower accuracy on unseen data.
Logistic Regression is a linear model that is generally simpler and less prone to overfitting. It models the relationship between input features and the target variable (film scores) in a more linear and interpretable manner.

In addition, the quality and quantity of data also play a crucial role. In this case we may not have a large amount of data and the indicator we are using (tf*IDF) is too basic.

But on balance, we have obtained very low accuracies, which may be due to the simplicity of these models and to the fact that they only create relationships between the occurrence of words and the attainment of certain scores. To obtain better results, it would be more appropriate to consider more features about the reviews and the words used.   


