### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [None]:
!wget https://users.itk.ppke.hu/~pasda2/Books_rating.csv.zip
!unzip Books_rating.csv.zip

--2024-05-21 06:16:14--  https://users.itk.ppke.hu/~pasda2/Books_rating.csv.zip
Resolving users.itk.ppke.hu (users.itk.ppke.hu)... 193.225.109.33
Connecting to users.itk.ppke.hu (users.itk.ppke.hu)|193.225.109.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1079521208 (1.0G) [application/zip]
Saving to: ‘Books_rating.csv.zip’


2024-05-21 06:16:30 (66.4 MB/s) - ‘Books_rating.csv.zip’ saved [1079521208/1079521208]

Archive:  Books_rating.csv.zip
  inflating: Books_rating.csv        


In [None]:
!pip install pyspark
!pip install sparkmeasure


Collecting sparkmeasure
  Downloading sparkmeasure-0.24.0-py2.py3-none-any.whl (5.8 kB)
Installing collected packages: sparkmeasure
Successfully installed sparkmeasure-0.24.0


In [None]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark import SparkConf
conf = SparkConf().setAppName("Clustering test")
spark = SparkSession \
    .builder \
    .config(conf=conf)\
    .getOrCreate()



df=spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
print((df.count(), len(df.columns)))
import csv
import time
from pyspark.sql.types import *
columns = []
st=time.time()

for i in df.dtypes:
    columns.append(i[0])
print(columns)

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names:
        df = df.withColumn(name, df[name].cast(newType))
    return df


# Conver the `df` columns to `FloatType()`
df = convertColumn(df, columns, FloatType())

df.fillna(value=0)
df.printSchema()
print(time.time()-st)

from traitlets.traitlets import Float
from pyspark.sql.types import IntegerType

df=df.fillna(value=0)
df.show()
import numpy as np
from pyspark.sql import functions as F
def convertColumnNeg(df, names):
    for name in names:
        df = df.withColumn(name,F.when(df[name]<0,0).otherwise(F.col(name)))
    return df

df=convertColumnNeg(df,columns)
df=df.replace([np.inf, -np.inf], 0)
print(time.time()-st)


from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
incols=columns
print(incols)
incols.remove("review/score")
assembler = VectorAssembler(inputCols=incols,outputCol="features")
df = assembler.transform(df)
final_data = df.select("features", F.col("review/score").alias("score"))

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator


# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(final_data)

# Make predictions
predictions = model.transform(final_data)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 06:26:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

(3000000, 10)
root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- profileName: string (nullable = true)
 |-- review/helpfulness: string (nullable = true)
 |-- review/score: string (nullable = true)
 |-- review/time: string (nullable = true)
 |-- review/summary: string (nullable = true)
 |-- review/text: string (nullable = true)

+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|        Id|               Title|Price|       User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|1882931173|Its Only Art If I...| null| AVCGYZL8FQQTD|"Jim of Oz ""jim-...|   

                                                                                

+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|         Id|Title|Price|User_id|profileName|review/helpfulness|review/score|review/time|review/summary|review/text|
+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|1.8829312E9|  0.0|  0.0|    0.0|        0.0|                 0|         4.0| 9.406368E8|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         5.0|1.0957248E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         5.0|1.0787904E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         4.0|1.0907136E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         4.0|1.1079936E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|

[Stage 44:>                                                         (0 + 1) / 1]

Silhouette with squared euclidean distance = 0.990858110432061
Cluster Centers: 
[3.79041007e+08 0.00000000e+00 3.22198067e+00 0.00000000e+00
 2.93680980e-01 0.00000000e+00 1.12063589e+09 4.41401863e+06
 1.26478252e+06]
[9.48066808e+09 1.79288507e+03 9.33857889e-01 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.09497382e+09 4.85065724e+06
 1.83772017e+06]


                                                                                

In [None]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark import SparkConf
conf = SparkConf().setAppName("Clustering test")
spark = SparkSession \
    .builder \
    .config(conf=conf)\
    .getOrCreate()



df=spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
print((df.count(), len(df.columns)))
import csv
import time
from pyspark.sql.types import *
columns = []
st=time.time()

for i in df.dtypes:
    columns.append(i[0])
print(columns)

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names:
        df = df.withColumn(name, df[name].cast(newType))
    return df


# Conver the `df` columns to `FloatType()`
df = convertColumn(df, columns, FloatType())

df.fillna(value=0)
df.printSchema()
print(time.time()-st)

from traitlets.traitlets import Float
from pyspark.sql.types import IntegerType

df=df.fillna(value=0)
df.show()
import numpy as np
from pyspark.sql import functions as F
def convertColumnNeg(df, names):
    for name in names:
        df = df.withColumn(name,F.when(df[name]<0,0).otherwise(F.col(name)))
    return df

df=convertColumnNeg(df,columns)
df=df.replace([np.inf, -np.inf], 0)
print(time.time()-st)


from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
incols=columns
print(incols)
incols.remove("review/score")
assembler = VectorAssembler(inputCols=incols,outputCol="features")
df = assembler.transform(df)
final_data = df.select("features", F.col("review/score").alias("score"))

from pyspark.ml.clustering import LDA


# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(final_data)

ll = model.logLikelihood(final_data)
lp = model.logPerplexity(final_data)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(final_data)
transformed.show(truncate=False)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 06:29:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

(3000000, 10)
root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Price: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- profileName: string (nullable = true)
 |-- review/helpfulness: string (nullable = true)
 |-- review/score: string (nullable = true)
 |-- review/time: string (nullable = true)
 |-- review/summary: string (nullable = true)
 |-- review/text: string (nullable = true)

+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|        Id|               Title|Price|       User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|1882931173|Its Only Art If I...| null| AVCGYZL8FQQTD|"Jim of Oz ""jim-...|   

                                                                                

+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|         Id|Title|Price|User_id|profileName|review/helpfulness|review/score|review/time|review/summary|review/text|
+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|1.8829312E9|  0.0|  0.0|    0.0|        0.0|                 0|         4.0| 9.406368E8|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         5.0|1.0957248E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         5.0|1.0787904E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         4.0|1.0907136E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|         4.0|1.1079936E9|           0.0|        0.0|
|8.2641434E8|  0.0|  0.0|    0.0|        0.0|                 0|

24/05/21 06:30:00 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/05/21 06:30:00 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


The lower bound on the log likelihood of the entire corpus: -34743482477344.62
The upper bound on perplexity: 0.36842704359015677
The topics described by their top-weighted terms:
+-----+-----------+-------------------------------------------------------------------+
|topic|termIndices|termWeights                                                        |
+-----+-----------+-------------------------------------------------------------------+
|0    |[0, 6, 8]  |[0.8945468759011631, 0.10534636157243132, 1.0666821443806964E-4]   |
|1    |[2, 6, 5]  |[0.9333602771802407, 0.06413065438129603, 3.9510862332303525E-4]   |
|2    |[2, 7, 6]  |[0.652047032026018, 0.07732664572683579, 0.04971493008684246]      |
|3    |[4, 6, 3]  |[0.9264605176866373, 0.06011140914728876, 0.0021902430403172913]   |
|4    |[6, 0, 2]  |[0.99103738390442, 0.00896261576278761, 3.3243271799326896E-10]    |
|5    |[7, 0, 4]  |[0.9799853399009457, 0.02001458279803936, 7.038819638424458E-8]    |
|6    |[0, 6, 7]  |[0.576123

[Stage 68:>                                                         (0 + 1) / 1]

+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Id         |Title|Price|User_id|profileName|review/helpfulness|review/score|review/time|review/summary|review/text|features                                                 |topicDistribution                                                                                                                                                                                                             |
+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+---------------------------------------------------------+------------------------------

                                                                                

In [None]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark import SparkConf
conf = SparkConf().setAppName("Clustering test")
spark = SparkSession \
    .builder \
    .config(conf=conf)\
    .getOrCreate()



df=spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
print((df.count(), len(df.columns)))
import csv
import time
from pyspark.sql.types import *
columns = []
st=time.time()

for i in df.dtypes:
    columns.append(i[0])
print(columns)

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names:
        df = df.withColumn(name, df[name].cast(newType))
    return df


# Conver the `df` columns to `FloatType()`
df = convertColumn(df, columns, FloatType())

df.fillna(value=0)
df.printSchema()
print(time.time()-st)

from traitlets.traitlets import Float
from pyspark.sql.types import IntegerType

df=df.fillna(value=0)
df.show()
import numpy as np
from pyspark.sql import functions as F
def convertColumnNeg(df, names):
    for name in names:
        df = df.withColumn(name,F.when(df[name]<0,0).otherwise(F.col(name)))
    return df

df=convertColumnNeg(df,columns)
df=df.replace([np.inf, -np.inf], 0)
print(time.time()-st)


from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
incols=columns
print(incols)
incols.remove("review/score")
assembler = VectorAssembler(inputCols=incols,outputCol="features")
df = assembler.transform(df)
final_data = df.select("features", F.col("review/score").alias("score"))

from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.evaluation import ClusteringEvaluator


# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(final_data)

# Make predictions
predictions = model.transform(final_data)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 07:16:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

(3000000, 10)
['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']
root
 |-- Id: float (nullable = true)
 |-- Title: float (nullable = true)
 |-- Price: float (nullable = true)
 |-- User_id: float (nullable = true)
 |-- profileName: float (nullable = true)
 |-- review/helpfulness: integer (nullable = true)
 |-- review/score: float (nullable = true)
 |-- review/time: float (nullable = true)
 |-- review/summary: float (nullable = true)
 |-- review/text: float (nullable = true)

0.2221980094909668
+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|         Id|Title|Price|User_id|profileName|review/helpfulness|review/score|review/time|review/summary|review/text|
+-----------+-----+-----+-------+-----------+------------------+------------+-----------+--------------+-----------+
|1.8829312E9|  0.0|  0.0|    0.0|        0.0|                 0| 



Silhouette with squared euclidean distance = 0.99999966362206
Cluster Centers: 
[4.63098450e+08 2.70117401e+00 3.48817748e+00 1.83539728e-03
 8.10715029e+06 3.17554749e+03 1.12527564e+09 5.19462755e+06
 1.52755909e+06]
[9.70988608e+08 0.00000000e+00 0.00000000e+00 0.00000000e+00
 3.72173244e+14 0.00000000e+00 9.92131200e+08 0.00000000e+00
 0.00000000e+00]


                                                                                

In [None]:
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark import SparkConf
conf = SparkConf().setAppName("Clustering test")
spark = SparkSession \
    .builder \
    .config(conf=conf)\
    .getOrCreate()



df=spark.read.csv("Books_rating.csv", header=True, inferSchema=True)
print((df.count(), len(df.columns)))
import csv
import time
from pyspark.sql.types import *
columns = []
st=time.time()

for i in df.dtypes:
    columns.append(i[0])
print(columns)

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names:
        df = df.withColumn(name, df[name].cast(newType))
    return df


# Conver the `df` columns to `FloatType()`
df = convertColumn(df, columns, FloatType())

df.fillna(value=0)
print(time.time()-st)

from traitlets.traitlets import Float
from pyspark.sql.types import IntegerType

df=df.fillna(value=0)
import numpy as np
from pyspark.sql import functions as F
def convertColumnNeg(df, names):
    for name in names:
        df = df.withColumn(name,F.when(df[name]<0,0).otherwise(F.col(name)))
    return df

df=convertColumnNeg(df,columns)
df=df.replace([np.inf, -np.inf], 0)
print(time.time()-st)


from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
incols=columns
print(incols)
incols.remove("review/score")
assembler = VectorAssembler(inputCols=incols,outputCol="features")
df = assembler.transform(df)
final_data = df.select("features", F.col("review/score").alias("score"))

from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(final_data)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)

                                                                                

(3000000, 10)
['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']
0.15661048889160156
0.28365468978881836
['Id', 'Title', 'Price', 'User_id', 'profileName', 'review/helpfulness', 'review/score', 'review/time', 'review/summary', 'review/text']


24/05/21 07:25:03 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
24/05/21 07:25:03 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
24/05/21 07:25:03 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/05/21 07:25:03 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
24/05/21 07:25:03 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/05/21 07:25:03 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
                                                                                

Gaussians shown as a DataFrame: 
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|mean                                                                                                                                                                               |cov                                                                                                                   

