<a href="https://colab.research.google.com/github/olalepek/PySpark_CNN_Article_Frequent_Items/blob/main/Articles_CNN_Frequent_Items_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Frequent words recognition in the CNN articles and articles highlights using FPG Growth model in Pyspark and Apriori Alghoritm using Pandas**

1. Data source: https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail
2. Analysis of the highlights & article text with different sensitivities
3. Analysis of the highlights using pandas for the comparison of the computationl efficiency

# Setting up Kaggle 

In [None]:
import timeit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import kaggle

OSError: ignored

In [None]:
!pip install -q kaggle



Uploading the key .json file to access Kaggle

In [None]:

from google.colab import files
files.upload()
# https://www.kaggle.com/general/74235



Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"olalepek","key":"73bdd01d581a194b39d3e51c94a4bf92"}'}

In [None]:
!cp kaggle.json ~/.kaggle/

In [None]:
!chmod 600 ~/.kaggle/kaggle.json



#  CNN Article  Market Basket Analysis in PySpark

## Setting up PySpark

In [None]:
# Download Java Virtual Machine (JVM)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz


In [None]:
!tar -xf '/content/spark-3.3.2-bin-hadoop3.tgz'

In [None]:
# Set up the enviornment
import os
os.environ["JAVA_HOME"] = '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ["SPARK_HOME"] = '/content/spark-3.3.2-bin-hadoop3'


In [None]:
# Install library for finding Spark
!pip install -q findspark

In [None]:
# Import the libary
import findspark

In [None]:
# Initiate findspark
findspark.init()

In [None]:
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [None]:
# Import a Spark function from library to verify
from pyspark.sql.functions import col

### Loading dataset

In [None]:
# https://www.kaggle.com/code/sercanyesiloz/pyspark-tutorial/notebook

In [None]:

!kaggle datasets download -d gowrishankarp/newspaper-text-summarization-cnn-dailymail

Downloading newspaper-text-summarization-cnn-dailymail.zip to /content
100% 501M/503M [00:21<00:00, 25.5MB/s]
100% 503M/503M [00:21<00:00, 24.4MB/s]


In [None]:
!unzip /content/newspaper-text-summarization-cnn-dailymail.zip -d /content/CNN

Archive:  /content/newspaper-text-summarization-cnn-dailymail.zip
  inflating: /content/CNN/cnn_dailymail/test.csv  
  inflating: /content/CNN/cnn_dailymail/train.csv  
  inflating: /content/CNN/cnn_dailymail/validation.csv  


In [None]:

df = spark \
    .read \
    .format("csv") \
    .option("header", True) \
    .load("/content/CNN/cnn_dailymail")

df.printSchema()

root
 |-- id: string (nullable = true)
 |-- article: string (nullable = true)
 |-- highlights: string (nullable = true)



In [None]:
df.show(5)

+--------------------+--------------------+--------------------+
|                  id|             article|          highlights|
+--------------------+--------------------+--------------------+
|0001d1afc246a7964...|By . Associated P...|Bishop John Folda...|
|He contracted the...|                null|                null|
|Church members in...| Grand Forks and ...|                null|
|0002095e55fcbd3a2...|"(CNN) -- Ralph M...|"" of using his r...|
|          Ralph Mata| an internal affa...| allegedly helped...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



In [None]:
import gc
gc.collect()

42

Removing the rows with the NULL values

In [None]:
df = df.dropna()

## Data Pre-processing - Highlights

Renaming the highlights column to text - as this is what we will start to analyse

In [None]:
df = df.withColumnRenamed("highlights", "text")

In [None]:
df.show(5)

+--------------------+--------------------+--------------------+
|                  id|             article|                text|
+--------------------+--------------------+--------------------+
|0001d1afc246a7964...|By . Associated P...|Bishop John Folda...|
|0002095e55fcbd3a2...|"(CNN) -- Ralph M...|"" of using his r...|
|          Ralph Mata| an internal affa...| allegedly helped...|
|00027e965c8264c35...|A drunk driver wh...|Craig Eccleston-T...|
|0002c17436637c4fe...|(CNN) -- With a b...|Nina dos Santos s...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



Removing any links, punctuation, numbers - useful for the tweets or other social media sourced text

In [None]:
from pyspark.sql.functions import regexp_replace


# Remove links, punctuation (REGEX provided) and numbers
df = df.withColumn('text', regexp_replace(df.text, 'https://t.co/', ' '))
df = df.withColumn('text', regexp_replace(df.text, '[_#()%&:;,.!?\\-]', ' '))
df = df.withColumn('text', regexp_replace(df.text, '[0-9]', ' '))

# Merge multiple spaces
df = df.withColumn('text', regexp_replace(df.text, ' +', ' '))

df.show(10)

+--------------------+--------------------+--------------------+
|                  id|             article|                text|
+--------------------+--------------------+--------------------+
|0001d1afc246a7964...|By . Associated P...|Bishop John Folda...|
|0002095e55fcbd3a2...|"(CNN) -- Ralph M...|"" of using his r...|
|          Ralph Mata| an internal affa...| allegedly helped...|
|00027e965c8264c35...|A drunk driver wh...|Craig Eccleston T...|
|0002c17436637c4fe...|(CNN) -- With a b...|Nina dos Santos s...|
|0003ad6ef0c37534f...|Fleetwood are the...|Fleetwood top of ...|
|        Peterborough|        Bristol City| Chesterfield and...|
|0004306354494f090...|He's been accused...|Prime Minister an...|
|0005d61497d21ff37...|By . Daily Mail R...|NBA star calls fo...|
|0006021f772fad0aa...|"By . Daily Mail ...| other passengers...|
+--------------------+--------------------+--------------------+
only showing top 10 rows



In [None]:
df = df["id","text"]

In [None]:
df.show(10)

+--------------------+--------------------+
|                  id|                text|
+--------------------+--------------------+
|0001d1afc246a7964...|Bishop John Folda...|
|0002095e55fcbd3a2...|"" of using his r...|
|          Ralph Mata| allegedly helped...|
|00027e965c8264c35...|Craig Eccleston T...|
|0002c17436637c4fe...|Nina dos Santos s...|
|0003ad6ef0c37534f...|Fleetwood top of ...|
|        Peterborough| Chesterfield and...|
|0004306354494f090...|Prime Minister an...|
|0005d61497d21ff37...|NBA star calls fo...|
|0006021f772fad0aa...| other passengers...|
+--------------------+--------------------+
only showing top 10 rows



Converting text into tokens and removing any stop words that don't bring meaning to the understanding of the text

In [None]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf, concat_ws
from pyspark.sql.types import IntegerType

tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W")
stopremove = StopWordsRemover(inputCol='tokens',outputCol='cleaned')

Creating pipline to apply the above methods to our dataframe

In [None]:
from pyspark.ml import Pipeline
data_prep_pipe = Pipeline(stages=[regexTokenizer, stopremove ])
cleaner = data_prep_pipe.fit(df)
df = cleaner.transform(df)

Collecting the garbage to keep the workspace clean and free up memory

In [None]:
import gc
gc.collect()

263

In [None]:
df.show(5)

+--------------------+--------------------+--------------------+--------------------+
|                  id|                text|              tokens|             cleaned|
+--------------------+--------------------+--------------------+--------------------+
|0001d1afc246a7964...|Bishop John Folda...|[bishop, john, fo...|[bishop, john, fo...|
|0002095e55fcbd3a2...|"" of using his r...|[of, using, his, ...|[using, role, pol...|
|          Ralph Mata| allegedly helped...|[allegedly, helpe...|[allegedly, helpe...|
|00027e965c8264c35...|Craig Eccleston T...|[craig, eccleston...|[craig, eccleston...|
|0002c17436637c4fe...|Nina dos Santos s...|[nina, dos, santo...|[nina, dos, santo...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



###Calculating the average number of tokens per row

In [None]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import collect_set, array_distinct

In [None]:
baskets = df.select(array_distinct(df.cleaned)).collect()
baskets = spark.createDataFrame(baskets)

In [None]:
baskets.show(5, False)

+---------------------------------------------------------------------------------------------+
|array_distinct(cleaned)                                                                      |
+---------------------------------------------------------------------------------------------+
|[bishop, john, folda, north, dakota, taking, time, diagnosed]                                |
|[using, role, police, officer, help, drug, trafficking, organization, exchange, money, gifts]|
|[allegedly, helped, group, get, guns]                                                        |
|[craig, eccleston, todd, drunk, least, three, pints, driving, car]                           |
|[nina, dos, santos, says, europe, must, ready, accept, sanctions, hurt, sides]               |
+---------------------------------------------------------------------------------------------+
only showing top 5 rows



In [None]:
count_tokens = udf(lambda words:len(words), IntegerType())
baskets = baskets.withColumn('count', count_tokens(col('array_distinct(cleaned)')))


In [None]:
baskets.describe().show()

+-------+-----------------+
|summary|            count|
+-------+-----------------+
|  count|           382518|
|   mean|7.969055051004136|
| stddev|5.965069522183348|
|    min|                0|
|    max|              222|
+-------+-----------------+



##Models for Highlights

In [None]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import collect_set, array_distinct
from pyspark.ml.fpm import FPGrowth



### Model with min support = 0.05



1.   Creating baskets - making sure that the rows have distinct 
2.   Setting up the model
3. Showing the top 10 results of most frequent items or items pairs


In [None]:
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10, False)
highlights_05 = timeit.default_timer() - start

+------+-----+
|items |freq |
+------+-----+
|[said]|29765|
+------+-----+



In [None]:
highlights_05

7.6207929269994565

###Model with min support = 0.005

In [None]:
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.005, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10, False)
highlights_005 = timeit.default_timer() - start

+----------+----+
|items     |freq|
+----------+----+
|[today]   |2720|
|[minister]|1978|
|[news]    |2335|
|[go]      |2540|
|[park]    |2158|
|[car]     |3067|
|[family]  |4126|
|[sunday]  |3578|
|[told]    |8415|
|[west]    |2879|
+----------+----+
only showing top 10 rows



In [None]:
highlights_005

10.809896277000007

In [None]:
model.associationRules.show(10)

+----------+----------+----------+----+-------+
|antecedent|consequent|confidence|lift|support|
+----------+----------+----------+----+-------+
+----------+----------+----------+----+-------+



###Model with min support = 0.0005

In [None]:
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.0005, minConfidence=1, itemsCol="array_distinct(cleaned)")
model2 = fpGrowth.fit(baskets)
model2.freqItemsets.show(10, False)
highlights_0005 = timeit.default_timer() - start

+-------------+----+
|items        |freq|
+-------------+----+
|[announce]   |246 |
|[insurance]  |290 |
|[singer]     |639 |
|[grand]      |896 |
|[trade]      |331 |
|[wounds]     |216 |
|[today]      |2720|
|[today, said]|295 |
|[isn]        |383 |
|[defender]   |498 |
+-------------+----+
only showing top 10 rows



In [None]:
highlights_0005

109.58784816499997

In [None]:
import gc
gc.collect()

45

In [None]:
model2.associationRules.show(10)

Py4JJavaError: ignored

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 45168)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/content/spark-3.3.2-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/spark-3.3.2-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/content/spark-3.3.2-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "/usr/lib/python3.9/sock

## Data Pre-processing - Article 

In [None]:
df = spark \
    .read \
    .format("csv") \
    .option("header", True) \
    .load("/content/CNN/cnn_dailymail")

df.printSchema()

root
 |-- id: string (nullable = true)
 |-- article: string (nullable = true)
 |-- highlights: string (nullable = true)



In [None]:
df = df.dropna()

In [None]:
df.count()

382518

In [None]:
df = df.withColumnRenamed("article", "text")

In [None]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf, concat_ws
from pyspark.sql.types import IntegerType

tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W")
stopremove = StopWordsRemover(inputCol='tokens',outputCol='cleaned')

In [None]:
from pyspark.ml import Pipeline
data_prep_pipe = Pipeline(stages=[regexTokenizer, stopremove ])
cleaner = data_prep_pipe.fit(df)
df = cleaner.transform(df)

In [None]:
df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                  id|                text|          highlights|              tokens|             cleaned|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|0001d1afc246a7964...|By . Associated P...|Bishop John Folda...|[by, associated, ...|[associated, pres...|
|0002095e55fcbd3a2...|"(CNN) -- Ralph M...|"" of using his r...|[cnn, ralph, mata...|[cnn, ralph, mata...|
|          Ralph Mata| an internal affa...| allegedly helped...|[an, internal, af...|[internal, affair...|
|00027e965c8264c35...|A drunk driver wh...|Craig Eccleston-T...|[a, drunk, driver...|[drunk, driver, k...|
|0002c17436637c4fe...|(CNN) -- With a b...|Nina dos Santos s...|[cnn, with, a, br...|[cnn, breezy, swe...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [None]:
df = df['text','cleaned']

In [None]:
df.count()

382518

In [None]:
from pyspark.sql.functions import monotonically_increasing_id 

df = df.select("*").withColumn("id", monotonically_increasing_id())

In [None]:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import collect_set, array_distinct
from pyspark.ml.fpm import FPGrowth
import timeit



### Models for 10 articles


In [None]:
df1 = df.where(df.id < 10)

In [None]:
df1.count()

10

In [None]:
df1.show()

+--------------------+--------------------+---+
|                text|             cleaned| id|
+--------------------+--------------------+---+
|By . Associated P...|[associated, pres...|  0|
|"(CNN) -- Ralph M...|[cnn, ralph, mata...|  1|
| an internal affa...|[internal, affair...|  2|
|A drunk driver wh...|[drunk, driver, k...|  3|
|(CNN) -- With a b...|[cnn, breezy, swe...|  4|
|Fleetwood are the...|[fleetwood, team,...|  5|
|        Bristol City|     [bristol, city]|  6|
|He's been accused...|[accused, making,...|  7|
|By . Daily Mail R...|[daily, mail, rep...|  8|
|"By . Daily Mail ...|[daily, mail, rep...|  9|
+--------------------+--------------------+---+



In [None]:

baskets = df1.select(array_distinct(df1.cleaned)).collect()
baskets = spark.createDataFrame(baskets)
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_10 = timeit.default_timer() - start

+--------------------+----+
|               items|freq|
+--------------------+----+
|        [atmosphere]|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
|[atmosphere, gips...|   1|
+--------------------+----+
only showing top 10 rows



In [None]:
article_10

1.6151313370000935

###Model with 100 articles



In [None]:
df2 = df.where(df.id < 100)

In [None]:
df2.count()

100

In [None]:
baskets = df2.select(array_distinct(df2.cleaned)).collect()
baskets = spark.createDataFrame(baskets)
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_100 = timeit.default_timer() - start



+--------------------+----+
|               items|freq|
+--------------------+----+
|              [need]|   5|
|        [need, like]|   5|
|              [good]|  12|
|        [good, next]|   7|
| [good, next, first]|   5|
|[good, next, firs...|   5|
|  [good, next, year]|   6|
|[good, next, year...|   6|
|  [good, next, made]|   5|
|[good, next, made...|   5|
+--------------------+----+
only showing top 10 rows



In [None]:
article_100

0.764393119000033

In [None]:
model.associationRules.show()

+--------------------+----------+----------+------------------+-------+
|          antecedent|consequent|confidence|              lift|support|
+--------------------+----------+----------+------------------+-------+
|[think, people, y...|    [four]|       1.0|  5.88235294117647|   0.05|
|[published, home,...|    [said]|       1.0| 1.923076923076923|   0.05|
|[published, home,...|    [time]|       1.0|2.7777777777777777|   0.05|
|[day, last, two, ...|   [court]|       1.0| 11.11111111111111|   0.05|
|[day, last, two, ...|    [said]|       1.0| 1.923076923076923|   0.05|
|[told, years, two...|     [day]|       1.0| 3.846153846153846|   0.05|
|[told, years, two...|    [time]|       1.0|2.7777777777777777|   0.05|
|[told, years, two...|    [year]|       1.0|               2.5|   0.05|
|[told, years, two...|    [said]|       1.0| 1.923076923076923|   0.05|
|[published, updat...|     [est]|       1.0| 7.692307692307692|   0.06|
|[2013, est, publi...|    [home]|       1.0| 4.166666666666667| 

###Model with 1000 articles

In [None]:
df3 = df.where(df.id < 1000)
baskets = df3.select(array_distinct(df3.cleaned)).collect()
baskets = spark.createDataFrame(baskets)
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_1000 = timeit.default_timer() - start


+--------------------+----+
|               items|freq|
+--------------------+----+
|              [want]| 127|
|       [want, right]|  56|
| [want, right, said]|  50|
|       [want, first]|  56|
|        [want, know]|  50|
|        [want, told]|  52|
|        [want, back]|  53|
|        [want, year]|  82|
|   [want, year, one]|  67|
|[want, year, one,...|  60|
+--------------------+----+
only showing top 10 rows



In [None]:
df3.count()

1000

In [None]:
article_1000

2.433014306000132

In [None]:
model.associationRules.show()

+--------------------+-----------+----------+------------------+-------+
|          antecedent| consequent|confidence|              lift|support|
+--------------------+-----------+----------+------------------+-------+
|[est, published, ...|  [updated]|       1.0| 5.780346820809249|  0.055|
|[2013, est, updat...|[published]|       1.0|  5.46448087431694|   0.07|
|[home, three, tol...|     [said]|       1.0|1.8083182640144664|   0.05|
|[later, told, old...|     [year]|       1.0| 2.347417840375587|  0.053|
|[2013, est, publi...|  [updated]|       1.0| 5.780346820809249|  0.054|
|[est, published, ...|  [updated]|       1.0| 5.780346820809249|  0.052|
|[right, left, tol...|     [said]|       1.0|1.8083182640144664|  0.054|
|       [work, added]|     [said]|       1.0|1.8083182640144664|  0.052|
|[est, published, ...|  [updated]|       1.0| 5.780346820809249|  0.053|
|   [spokesman, year]|     [said]|       1.0|1.8083182640144664|  0.058|
|[est, left, year,...|  [updated]|       1.0| 5.780

### Calculating the average number of tokens per article

In [None]:
count_tokens = udf(lambda words:len(words), IntegerType())
baskets = baskets.withColumn('count', count_tokens(col('array_distinct(cleaned)')))
baskets.describe().show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|              1000|
|   mean|           149.276|
| stddev|128.33184673683647|
|    min|                 1|
|    max|               614|
+-------+------------------+



###Model with 5000 articles

In [None]:
df5 = df.where(df.id < 5000)

baskets = df5.select(array_distinct(df5.cleaned)).collect()
baskets = spark.createDataFrame(baskets)
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_5000 = timeit.default_timer() - start

+--------------------+----+
|               items|freq|
+--------------------+----+
|           [however]| 668|
|     [however, like]| 262|
|      [however, new]| 323|
|[however, new, said]| 279|
|     [however, told]| 313|
|[however, told, s...| 288|
|     [however, year]| 429|
|[however, year, one]| 320|
|[however, year, o...| 279|
|[however, year, s...| 373|
+--------------------+----+
only showing top 10 rows



In [None]:
df5.count()

5000

In [None]:
article_5000

14.432767931999933

In [None]:
model.associationRules.show()

+--------------------+-----------+----------+-----------------+-------+
|          antecedent| consequent|confidence|             lift|support|
+--------------------+-----------+----------+-----------------+-------+
|[est, published, ...|  [updated]|       1.0|5.359056806002144|  0.062|
|[2013, est, updat...|[published]|       1.0| 5.09683995922528| 0.0632|
|[est, published, ...|  [updated]|       1.0|5.359056806002144| 0.0564|
|[est, left, year,...|  [updated]|       1.0|5.359056806002144| 0.0548|
|[est, published, ...|  [updated]|       1.0|5.359056806002144| 0.0572|
|[updated, publish...|      [est]|       1.0|5.773672055427252| 0.0694|
|[10, est, published]|  [updated]|       1.0|5.359056806002144| 0.0568|
|[updated, publish...|      [est]|       1.0|5.773672055427252|  0.062|
|          [est, new]|  [updated]|       1.0|5.359056806002144| 0.0738|
|[est, published, ...|  [updated]|       1.0|5.359056806002144| 0.0608|
|         [est, life]|  [updated]|       1.0|5.359056806002144| 

### Model with 7500 articles

In [None]:
df6 = df.where(df.id < 7500)

baskets = df6.select(array_distinct(df6.cleaned)).collect()
baskets = spark.createDataFrame(baskets)

start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_7500 = timeit.default_timer() - start

+--------------------+----+
|               items|freq|
+--------------------+----+
|              [went]| 988|
|        [went, made]| 395|
|         [went, new]| 412|
|      [went, people]| 437|
|[went, people, said]| 392|
|        [went, year]| 686|
|   [went, year, one]| 504|
|[went, year, one,...| 451|
|  [went, year, said]| 601|
|        [went, last]| 573|
+--------------------+----+
only showing top 10 rows



In [None]:
df6.count()

7500

In [None]:
article_7500

19.868800029999875

In [None]:
model.associationRules.show()

+--------------------+-----------+----------+-----------------+--------------------+
|          antecedent| consequent|confidence|             lift|             support|
+--------------------+-----------+----------+-----------------+--------------------+
|[est, published, ...|  [updated]|       1.0|5.289139633286319|              0.0612|
|[2013, est, updat...|[published]|       1.0|5.091649694501019| 0.06506666666666666|
|[est, published, ...|  [updated]|       1.0|5.289139633286319|0.054266666666666664|
|[est, published, ...|  [updated]|       1.0|5.289139633286319|0.056933333333333336|
|[10, est, published]|  [updated]|       1.0|5.289139633286319| 0.05893333333333333|
|[est, published, ...|  [updated]|       1.0|5.289139633286319| 0.05973333333333333|
|[2013, est, updat...|[published]|       1.0|5.091649694501019| 0.05506666666666667|
|   [2013, est, time]|[published]|       1.0|5.091649694501019|              0.0552|
|[mr, est, publish...|  [updated]|       1.0|5.289139633286319|0.

###Model with 10000 articles

In [None]:
df7 = df.where(df.id < 10000)

baskets = df7.select(array_distinct(df7.cleaned)).collect()
baskets = spark.createDataFrame(baskets)
start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_10000 = timeit.default_timer() - start

+--------------------+----+
|               items|freq|
+--------------------+----+
|              [says]|1308|
|        [says, made]| 506|
|         [says, new]| 644|
|   [says, new, said]| 549|
|      [says, people]| 647|
|[says, people, said]| 553|
|        [says, year]| 847|
|   [says, year, one]| 626|
|[says, year, one,...| 545|
|  [says, year, said]| 729|
+--------------------+----+
only showing top 10 rows



In [None]:
df7.count()

10000

In [None]:
article_10000

28.613862445999985

In [None]:
model.associationRules.show()

+--------------------+----------+----------+----------------+-------+
|          antecedent|consequent|confidence|            lift|support|
+--------------------+----------+----------+----------------+-------+
|[est, published, ...| [updated]|       1.0|5.31632110579479| 0.0599|
|[est, published, ...| [updated]|       1.0|5.31632110579479| 0.0526|
|[est, published, ...| [updated]|       1.0|5.31632110579479| 0.0546|
|[10, est, published]| [updated]|       1.0|5.31632110579479| 0.0582|
|[est, published, ...| [updated]|       1.0|5.31632110579479|  0.058|
|[mr, est, publish...| [updated]|       1.0|5.31632110579479| 0.0539|
|[2013, updated, p...|     [est]|       1.0|5.74712643678161| 0.0544|
|   [2013, est, last]| [updated]|       1.0|5.31632110579479| 0.0521|
|[family, est, pub...| [updated]|       1.0|5.31632110579479| 0.0509|
|[2013, updated, p...|     [est]|       1.0|5.74712643678161| 0.0543|
|[est, published, ...| [updated]|       1.0|5.31632110579479| 0.0531|
|[est, published, ..

### Model with 50000 articles

In [None]:
df8 = df.where(df.id < 50000)

baskets = df8.select(array_distinct(df8.cleaned)).collect()
baskets = spark.createDataFrame(baskets)

start = timeit.default_timer()
fpGrowth = FPGrowth(minSupport=0.05, minConfidence=1, itemsCol="array_distinct(cleaned)")
model = fpGrowth.fit(baskets)
model.freqItemsets.show(10)
article_50000 = timeit.default_timer() - start

Py4JJavaError: ignored

# CNN Article Highlights Market Basket Analysis in Pandas

## Data Preprocessing

In [None]:
df = pd.read_csv('/content/CNN/cnn_dailymail/test.csv')

In [None]:
df1 = pd.read_csv('/content/CNN/cnn_dailymail/train.csv')
df2 = pd.read_csv('/content/CNN/cnn_dailymail/validation.csv')

In [None]:
len(df2)+len(df)+len(df1)

311971

In [None]:
dataframe = pd.concat([df,df1,df2])

In [None]:
dataframe = dataframe.dropna(axis=0, how="any")

In [None]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 311971 entries, 0 to 13367
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          311971 non-null  object
 1   article     311971 non-null  object
 2   highlights  311971 non-null  object
dtypes: object(3)
memory usage: 9.5+ MB


In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

processed=[]


for article in dataframe['highlights']: 
    article = re.sub('http://\S+|https://\S+', "", article)
    article = re.sub("@[A-Za-z0-9]+","",article) 
    article = re.sub(r"www.\S+", "", article)
    
    article = re.sub('[^a-zA-Z]', ' ', article) #replacing any punctuation or anything that is not ^ a-z and A-Zletter with the space
  
    article = article.lower() #lowercase all the words
    article = article.split()# splitting the tweet into words

      # Stemming the words to keep only the roots using Porter Stemmer 
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    article = [ps.stem(word) for word in article if not word in set(all_stopwords)]
    article = ' '.join(article)
    processed.append(article)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df['processed'] = processed

In [None]:
df.head()

Unnamed: 0,id,article,highlights,processed
0,92c514c913c0bdfe25341af9fd72b29db544099b,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...,expert question pack plane put passeng risk u ...
1,2003841c7dc0e7c5b1a248f9cd536d727f27a45a,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...,drunk teenag boy climb lion enclosur zoo west ...
2,91b7d2311527f5c2b63a65ca98d21d9c92485149,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...,nottingham forest close extend dougi freedman ...
3,caabf9cbdf96eb1410295a673e953d304391bfbb,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...,fiorentina goalkeep neto link liverpool arsen ...
4,3da746a7d9afcaa659088c8366ef6347fe6b53ea,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6...",tell interview realiti tv star air friday apri...


Creating tokens that can be consumed by apriori alghoritm

In [None]:
tokens = df['processed']

In [None]:
transactions = tokens.str.split()

In [None]:
transactions = [tuple(row) for row in transactions.values.tolist()]

## Creating Apriori Models

In [None]:
!pip install efficient_apriori 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting efficient_apriori
  Downloading efficient_apriori-2.0.3-py3-none-any.whl (14 kB)
Installing collected packages: efficient_apriori
Successfully installed efficient_apriori-2.0.3


In [None]:
from efficient_apriori import apriori

In [None]:
start = timeit.default_timer()
itemsets, rules = apriori(transactions, min_support=0.05, min_confidence=1)
print(rules)
highlights_pandas_05 = timeit.default_timer() - start

[]


In [None]:
highlights_pandas_05 

0.29052923600011127

In [None]:
start = timeit.default_timer()
itemsets, rules = apriori(transactions, min_support=0.005, min_confidence=1)
print(rules)
highlights_pandas_005 = timeit.default_timer() - start

[{wenger} -> {arsen}, {gaal} -> {van}, {trafford} -> {old}, {raheem} -> {sterl}, {arsen, premier} -> {leagu}, {barcelona, madrid} -> {real}, {chelsea, premier} -> {leagu}, {citi, premier} -> {leagu}, {latest, news} -> {click}, {fa, semi} -> {cup}, {fight, floyd} -> {mayweath}, {floyd, manni} -> {mayweath}, {floyd, pacquiao} -> {mayweath}, {loui, manchest} -> {gaal}, {loui, unit} -> {gaal}, {loui, van} -> {gaal}, {gaal, loui} -> {van}, {gaal, manchest} -> {unit}, {gaal, manchest} -> {van}, {gaal, unit} -> {van}, {game, premier} -> {leagu}, {player, premier} -> {leagu}, {point, premier} -> {leagu}, {premier, saturday} -> {leagu}, {premier, side} -> {leagu}, {premier, top} -> {leagu}, {premier, unit} -> {leagu}, {premier, win} -> {leagu}, {liverpool, raheem} -> {sterl}, {loui, manchest} -> {unit}, {loui, manchest} -> {van}, {loui, unit} -> {van}, {trafford, unit} -> {old}, {citi, manchest, premier} -> {leagu}, {latest, leagu, news} -> {click}, {click, latest, leagu} -> {news}, {fa, final,

In [None]:
highlights_pandas_005

119.01999203700007

In [None]:
start = timeit.default_timer()
itemsets, rules = apriori(transactions, min_support=0.0005, min_confidence=1)
print(rules)
highlights_pandas_005 = timeit.default_timer() - start

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

