## Clean flu_data_topic for nan data

In [99]:
import pandas as pd

# Assuming you've already loaded your DataFrame from the CSV
data = pd.read_csv('data/flu_data_topic_2.csv')

# Convert string 'NaN' to actual NaN values (make sure your data actually contains the string 'NaN')
data.replace('NaN', pd.NA, inplace=True)

# Drop rows where either 'Topic' or 'Content' is NaN
cleaned_data = data.dropna(subset=['Topic', 'Content'], how='any')

# Display the cleaned DataFrame, particularly the rows of interest
print(cleaned_data)

cleaned_data.to_csv('data/flu_data_topic_cleaned.csv', index=False)

          Year                                              Topic  \
0    2006-2007   When and where did the 2006-07 flu season start?   
1    2006-2007           How severe was the 2006-2007 flu season?   
2    2006-2007      What determines the severity of a flu season?   
3    2006-2007  Where did the most flu activity occur in the U...   
4    2006-2007            When did the 2006-2007 flu season peak?   
..         ...                                                ...   
142  2023-2024  Updates to the Advisory Committee on Immunizat...   
143  2023-2024  Updates to U.S. Flu Surveillance Methods for t...   
144  2023-2024                B/Yamagata and Flu Vaccines Summary   
145  2023-2024  Coinfection: Getting More than One Respiratory...   
146  2023-2024  Flu, RSV, and COVID-19 Coinfection Data: 2023-...   

                                               Content  
0    The first report of regional flu activity came...  
1    The 2006-07 flu season was generally mild comp...  


## Merge data

In [100]:
import pandas as pd

# Load the CSV file
data = pd.read_csv('data/flu_data_topic_cleaned.csv')

# Convert NaN to empty strings and ensure all data in 'Topic' and 'Content' are strings
data['Topic'] = data['Topic'].fillna('').astype(str)
data['Content'] = data['Content'].fillna('').astype(str)

# Filter out rows where 'Content' is empty
data = data[data['Content'].str.strip() != '']

# Combine 'Topic' and 'Content' into a single string in a new column
# Conditionally concatenate 'Topic' and 'Content' only if 'Content' is not empty
data['MergedContent'] = data.apply(lambda row: f"{row['Topic']}: {row['Content']}" if row['Content'].strip() else row['Topic'], axis=1)

# Group by 'Year' and merge all content into a single cell per year
grouped_data = data.groupby('Year')['MergedContent'].apply('\n\n'.join).reset_index()

# Rename columns to match the desired output
grouped_data.columns = ['Year', 'Content']

# Save the transformed data back to a CSV file
grouped_data.to_csv('data/flu_data.csv', index=False)


## Text Cleaning

In [1]:
!pip install pandas
!pip install spacy



In [101]:
import pandas as pd
import re
import string
import spacy
import unicodedata

### Import file

In [102]:
df = pd.read_csv('data/flu_data.csv')

### Stopwords removal

In [103]:
df

Unnamed: 0,Year,Content
0,2006-2007,When and where did the 2006-07 flu season star...
1,2007-2008,How severe was the 2007-2008 flu season?: A gr...
2,2008-2009,What was the 2009-2010 flu season like?: Flu s...
3,2010-2011,What was the 2010-2011 flu season like?: In co...
4,2011-2012,What was the 2011-2012 flu season like?: In co...
5,2012-2013,What was the 2012-2013 flu season like?: In co...
6,2013-2014,When did flu activity peak?: The timing of flu...
7,2014-2015,What was the 2014-2015 flu season like?: Compa...
8,2015-2016,What was the 2015-2016 flu season like?: Flu s...
9,2016-2017,Information for 2016-2017: Getting an annual f...


In [104]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK resources (if not already downloaded)
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tongfah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tongfah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [105]:
text = df['Content'].values[0]

## Text cleaning

### Clean punctuation

In [106]:
import re

def clean_punctuation(text):
    punctuation_pattern = re.compile(r'[^\w\s]|_')
    cleaned_text = re.sub(punctuation_pattern, '', text)

    return cleaned_text

In [107]:
cleaned_text = clean_punctuation(text)
print("Text:",text)
print("Clean Text",cleaned_text)

Text: When and where did the 2006-07 flu season start?: The first report of regional flu activity came from the southeastern United States during the first week of November. Regional flu activity is defined as increased flu-like activity or flu outbreaks in at least two (but fewer than half) of the regions in a state with recent laboratory evidence of flu in those regions.

How severe was the 2006-2007 flu season?: The 2006-07 flu season was generally mild compared to recent flu seasons. For example, the proportion of all deaths associated with influenza illness was lower this season than the previous three flu seasons. Hospitalization rates among children were also lower than the previous three flu seasons. However, more pediatric deaths related to influenza were reported during the 2006-07 season than the previous two seasons. Nationally, low levels of flu activity were reported during October through mid-December. Flu activity increased during late December, peaked in mid-February, 

### Lower case

In [108]:
lower_text = cleaned_text.lower()
print(cleaned_text)
print(lower_text)

When and where did the 200607 flu season start The first report of regional flu activity came from the southeastern United States during the first week of November Regional flu activity is defined as increased flulike activity or flu outbreaks in at least two but fewer than half of the regions in a state with recent laboratory evidence of flu in those regions

How severe was the 20062007 flu season The 200607 flu season was generally mild compared to recent flu seasons For example the proportion of all deaths associated with influenza illness was lower this season than the previous three flu seasons Hospitalization rates among children were also lower than the previous three flu seasons However more pediatric deaths related to influenza were reported during the 200607 season than the previous two seasons Nationally low levels of flu activity were reported during October through midDecember Flu activity increased during late December peaked in midFebruary and decreased through the end o

### Character Normalization

In [109]:
import unicodedata

def normalize_characters(text):
    normalized_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    return normalized_text

In [110]:
normalized_text = normalize_characters(lower_text)
print(lower_text)
print(normalized_text)

when and where did the 200607 flu season start the first report of regional flu activity came from the southeastern united states during the first week of november regional flu activity is defined as increased flulike activity or flu outbreaks in at least two but fewer than half of the regions in a state with recent laboratory evidence of flu in those regions

how severe was the 20062007 flu season the 200607 flu season was generally mild compared to recent flu seasons for example the proportion of all deaths associated with influenza illness was lower this season than the previous three flu seasons hospitalization rates among children were also lower than the previous three flu seasons however more pediatric deaths related to influenza were reported during the 200607 season than the previous two seasons nationally low levels of flu activity were reported during october through middecember flu activity increased during late december peaked in midfebruary and decreased through the end o

## Text preprocessing

#### Lemmatization/Stemming

In [111]:
import spacy

def lemmatization(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    lemmatized_words = [token.lemma_ for token in doc]
    lemmatized_text = ' '.join(lemmatized_words)
    return lemmatized_text


In [112]:
lemmatized_text = lemmatization(normalized_text)

print(cleaned_text)
print(lemmatized_text)

When and where did the 200607 flu season start The first report of regional flu activity came from the southeastern United States during the first week of November Regional flu activity is defined as increased flulike activity or flu outbreaks in at least two but fewer than half of the regions in a state with recent laboratory evidence of flu in those regions

How severe was the 20062007 flu season The 200607 flu season was generally mild compared to recent flu seasons For example the proportion of all deaths associated with influenza illness was lower this season than the previous three flu seasons Hospitalization rates among children were also lower than the previous three flu seasons However more pediatric deaths related to influenza were reported during the 200607 season than the previous two seasons Nationally low levels of flu activity were reported during October through midDecember Flu activity increased during late December peaked in midFebruary and decreased through the end o

### Text cleaning after pre-processing

#### Tokenization

In [113]:
import string

def tokenization(text):
    text = text.translate(str.maketrans("", "", string.punctuation))
    tokens = text.split()
    return tokens

In [114]:
text_toknes = tokenization(lemmatized_text)
print(lemmatized_text)
print(text_toknes)

when and where do the 200607 flu season start the first report of regional flu activity come from the southeastern united states during the first week of november regional flu activity be define as increase flulike activity or flu outbreak in at least two but few than half of the region in a state with recent laboratory evidence of flu in those region 

 how severe be the 20062007 flu season the 200607 flu season be generally mild compare to recent flu season for example the proportion of all death associate with influenza illness be low this season than the previous three flu season hospitalization rate among child be also low than the previous three flu season however more pediatric death relate to influenza be report during the 200607 season than the previous two season nationally low level of flu activity be report during october through middecember flu activity increase during late december peak in midfebruary and decrease through the end of the flu season on may 19 

 what determ

#### Stopwords removal

In [115]:
def stopwords_removal(text: str):
    words = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return filtered_words

In [116]:
stopword_text = stopwords_removal(lemmatized_text)   
print(lower_text)
print(stopword_text)

when and where did the 200607 flu season start the first report of regional flu activity came from the southeastern united states during the first week of november regional flu activity is defined as increased flulike activity or flu outbreaks in at least two but fewer than half of the regions in a state with recent laboratory evidence of flu in those regions

how severe was the 20062007 flu season the 200607 flu season was generally mild compared to recent flu seasons for example the proportion of all deaths associated with influenza illness was lower this season than the previous three flu seasons hospitalization rates among children were also lower than the previous three flu seasons however more pediatric deaths related to influenza were reported during the 200607 season than the previous two seasons nationally low levels of flu activity were reported during october through middecember flu activity increased during late december peaked in midfebruary and decreased through the end o

In [131]:
def remove_digits(list_token_text):
    filtered_tokens = [token for token in list_token_text if not token.isdigit()]
    return filtered_tokens

In [133]:
filtered_tokens = remove_digits(stopword_text)

print(stopword_text)
print(filtered_tokens)

['200607', 'flu', 'season', 'start', 'first', 'report', 'regional', 'flu', 'activity', 'come', 'southeastern', 'united', 'states', 'first', 'week', 'november', 'regional', 'flu', 'activity', 'define', 'increase', 'flulike', 'activity', 'flu', 'outbreak', 'least', 'two', 'half', 'region', 'state', 'recent', 'laboratory', 'evidence', 'flu', 'region', 'severe', '20062007', 'flu', 'season', '200607', 'flu', 'season', 'generally', 'mild', 'compare', 'recent', 'flu', 'season', 'example', 'proportion', 'death', 'associate', 'influenza', 'illness', 'low', 'season', 'previous', 'three', 'flu', 'season', 'hospitalization', 'rate', 'among', 'child', 'also', 'low', 'previous', 'three', 'flu', 'season', 'however', 'pediatric', 'death', 'relate', 'influenza', 'report', '200607', 'season', 'previous', 'two', 'season', 'nationally', 'low', 'level', 'flu', 'activity', 'report', 'october', 'middecember', 'flu', 'activity', 'increase', 'late', 'december', 'peak', 'midfebruary', 'decrease', 'end', 'flu', 

In [134]:
def preprocess_text(text: str):
    cleaned_text = clean_punctuation(text)
    lower_text = cleaned_text.lower()
    normalized_text = normalize_characters(lower_text)
    lemmatized_text = lemmatization(normalized_text)
    tokens = tokenization(lemmatized_text)
    stopword_text = stopwords_removal(lemmatized_text)
    filtered_tokens = remove_digits(stopword_text)
    return filtered_tokens

In [135]:
df['Token'] = df['Content'].apply(preprocess_text)

df

Unnamed: 0,Year,Content,Token
0,2006-2007,When and where did the 2006-07 flu season star...,"[flu, season, start, first, report, regional, ..."
1,2007-2008,How severe was the 2007-2008 flu season?: A gr...,"[severe, flu, season, great, proportion, death..."
2,2008-2009,What was the 2009-2010 flu season like?: Flu s...,"[flu, season, like, flu, season, unpredictable..."
3,2010-2011,What was the 2010-2011 flu season like?: In co...,"[flu, season, like, comparison, last, three, s..."
4,2011-2012,What was the 2011-2012 flu season like?: In co...,"[flu, season, like, comparison, season, season..."
5,2012-2013,What was the 2012-2013 flu season like?: In co...,"[flu, season, like, comparison, recent, season..."
6,2013-2014,When did flu activity peak?: The timing of flu...,"[flu, activity, peak, timing, flu, unpredictab..."
7,2014-2015,What was the 2014-2015 flu season like?: Compa...,"[flu, season, like, compare, previous, five, i..."
8,2015-2016,What was the 2015-2016 flu season like?: Flu s...,"[flu, season, like, flu, season, vary, timing,..."
9,2016-2017,Information for 2016-2017: Getting an annual f...,"[information, get, annual, flu, vaccine, first..."


In [136]:
df.to_csv('data/flu_data_token.csv', index=False)

In [2]:
import pandas as pd

drop_row_df = pd.read_csv('data/flu_data_summary_abstractive.csv')
drop_row_df

Unnamed: 0,Year,Topic,Content,LexRank,TextRank,Pegasus,Bart
0,2006-2007,When and where did the 2006-07 flu season start?,The first report of regional flu activity came...,Regional flu activity is defined as increased ...,,Regional flu activity is defined as increased ...,The first report of regional flu activity came...
1,2006-2007,How severe was the 2006-2007 flu season?,The 2006-07 flu season was generally mild comp...,"Flu activity increased during late December, p...","For example, the proportion of all deaths asso...","For example, the proportion of all deaths asso...",The 2006-07 flu season was generally mild comp...
2,2006-2007,What determines the severity of a flu season?,"The overall health impact (e.g., infections, h...",The severity of a flu season can be judged acc...,,and within each state;nThe proportion of influ...,"The overall health impact (e.g., infections, h..."
3,2006-2007,Where did the most flu activity occur in the U...,Influenza viruses were identified in all state...,"From October 1, 2006 to May 19, 2007, widespre...",,"From October 1, 2006 to May 19, 2007, widespre...","From October 1, 2006 to May 19, 2007, widespre..."
4,2006-2007,When did the 2006-2007 flu season peak?,"During the 2006-2007 season, flu activity in t...",Although the timing of peak activity varies fr...,,"During the past 31 years, flu activity in the ...","During the past 31 years, flu activity in the ..."
...,...,...,...,...,...,...,...
136,2023-2024,Updates to the Advisory Committee on Immunizat...,A couple of things are different for the 2023-...,A couple of things are different for the 2023-...,,A couple of things are different for the,The flu season of 2023-
137,2023-2024,Updates to U.S. Flu Surveillance Methods for t...,"Starting with the 2023-2024 influenza season, ...",Flu vaccination is often available at no or lo...,Flu vaccination is often available at no or lo...,Although monitoring influenza-only coded death...,"Starting with the 2023-2024 influenza season, ..."
138,2023-2024,B/Yamagata and Flu Vaccines Summary,Quadrivalent flu vaccines protect against four...,CDC is not involved in regulatory decision-mak...,Quadrivalent flu vaccines protect against four...,Quadrivalent flu vaccines protect against four...,All current flu vaccines in the United States ...
139,2023-2024,Coinfection: Getting More than One Respiratory...,It is possible to get sick with more than one ...,It is also possible to be sick with multiple f...,,It is also possible to be sick with multiple f...,It is possible to get sick with more than one ...


In [6]:
df_clean = drop_row_df.dropna()

In [None]:
df_clean['']

In [7]:
print(df_clean)

          Year                                              Topic  \
1    2006-2007           How severe was the 2006-2007 flu season?   
5    2006-2007  How many people died from flu during the 2006-...   
7    2006-2007  Was there a good match between the influenza s...   
9    2006-2007  What can be done to protect children from flu-...   
16   2007-2008           What flu viruses circulated this season?   
..         ...                                                ...   
134  2021-2022  What vaccine uptake estimates did CDC provide ...   
135  2021-2022  Were there any updates in the methods for flu ...   
137  2023-2024  Updates to U.S. Flu Surveillance Methods for t...   
138  2023-2024                B/Yamagata and Flu Vaccines Summary   
140  2023-2024  Flu, RSV, and COVID-19 Coinfection Data: 2023-...   

                                               Content  \
1    The 2006-07 flu season was generally mild comp...   
5    Exact numbers of how many people died from flu... 

In [8]:
df_clean.to_csv('data/flu_data_summary_qa.csv', index=False)

### Check lenght

In [10]:
import pandas as pd

df_qa = pd.read_csv('data/flu_data_summary_qa.csv')
df_qa

Unnamed: 0,Year,Topic,Content,LexRank,TextRank,Pegasus,Bart
0,2006-2007,How severe was the 2006-2007 flu season?,The 2006-07 flu season was generally mild comp...,"Flu activity increased during late December, p...","For example, the proportion of all deaths asso...","For example, the proportion of all deaths asso...",The 2006-07 flu season was generally mild comp...
1,2006-2007,How many people died from flu during the 2006-...,Exact numbers of how many people died from flu...,Estimates of flu-associated deaths are made by...,Estimates of flu-associated deaths are made by...,This system collects information each week on ...,Flu-associated deaths are only a nationally no...
2,2006-2007,Was there a good match between the influenza s...,The influenza A (H1) component of the 2006-07 ...,"Overall for the 2006-07 season, 24 percent of ...","In the early months of the season, the majorit...",Fifty percent of the influenza B viruses chara...,The influenza A (H1) component of the 2006-07 ...
3,2006-2007,What can be done to protect children from flu-...,Vaccination remains the best method for preven...,Children with asthma or other conditions shoul...,Household contacts and caregivers of these chi...,Household contacts and caregivers of these chi...,Vaccination remains the best method for preven...
4,2007-2008,What flu viruses circulated this season?,"In the United States, influenza A (H1N1), A (H...",Flu A viruses are subtyped in public health la...,Influenza A viruses accounted for 71% of the s...,Influenza A viruses accounted for 71% of the s...,Influenza A viruses accounted for 71% of the ...
...,...,...,...,...,...,...,...
65,2021-2022,What vaccine uptake estimates did CDC provide ...,CDC’s Weekly Flu Vaccination Dashboard provide...,Additional information about NIS-Flu methods a...,CDC’s Weekly Flu Vaccination Dashboard provide...,The dashboard included information on the numb...,CDC’s Weekly Flu Vaccination Dashboard provide...
66,2021-2022,Were there any updates in the methods for flu ...,"During the 2021-2022 flu season, there were a ...",More information on flu surveillance methodolo...,Hospitals in all 50 states and U.S. territorie...,CDC also added a surveillance system that trac...,"CDC added another surveillance system, the HHS..."
67,2023-2024,Updates to U.S. Flu Surveillance Methods for t...,"Starting with the 2023-2024 influenza season, ...",Flu vaccination is often available at no or lo...,Flu vaccination is often available at no or lo...,Although monitoring influenza-only coded death...,"Starting with the 2023-2024 influenza season, ..."
68,2023-2024,B/Yamagata and Flu Vaccines Summary,Quadrivalent flu vaccines protect against four...,CDC is not involved in regulatory decision-mak...,Quadrivalent flu vaccines protect against four...,Quadrivalent flu vaccines protect against four...,All current flu vaccines in the United States ...


In [13]:
import pandas as pd

# Assuming your DataFrame is already loaded as df_qa
df_qa = pd.read_csv('data/flu_data_summary_qa.csv')

# Calculate the length of the text in the 'Content' column for each row
df_qa['Content_Length'] = df_qa['Content'].apply(len)

# Display the DataFrame with the new 'Content_Length' column
df_qa

Unnamed: 0,Year,Topic,Content,LexRank,TextRank,Pegasus,Bart,Content_Length
0,2006-2007,How severe was the 2006-2007 flu season?,The 2006-07 flu season was generally mild comp...,"Flu activity increased during late December, p...","For example, the proportion of all deaths asso...","For example, the proportion of all deaths asso...",The 2006-07 flu season was generally mild comp...,639
1,2006-2007,How many people died from flu during the 2006-...,Exact numbers of how many people died from flu...,Estimates of flu-associated deaths are made by...,Estimates of flu-associated deaths are made by...,This system collects information each week on ...,Flu-associated deaths are only a nationally no...,1168
2,2006-2007,Was there a good match between the influenza s...,The influenza A (H1) component of the 2006-07 ...,"Overall for the 2006-07 season, 24 percent of ...","In the early months of the season, the majorit...",Fifty percent of the influenza B viruses chara...,The influenza A (H1) component of the 2006-07 ...,1026
3,2006-2007,What can be done to protect children from flu-...,Vaccination remains the best method for preven...,Children with asthma or other conditions shoul...,Household contacts and caregivers of these chi...,Household contacts and caregivers of these chi...,Vaccination remains the best method for preven...,1542
4,2007-2008,What flu viruses circulated this season?,"In the United States, influenza A (H1N1), A (H...",Flu A viruses are subtyped in public health la...,Influenza A viruses accounted for 71% of the s...,Influenza A viruses accounted for 71% of the s...,Influenza A viruses accounted for 71% of the ...,1024
...,...,...,...,...,...,...,...,...
65,2021-2022,What vaccine uptake estimates did CDC provide ...,CDC’s Weekly Flu Vaccination Dashboard provide...,Additional information about NIS-Flu methods a...,CDC’s Weekly Flu Vaccination Dashboard provide...,The dashboard included information on the numb...,CDC’s Weekly Flu Vaccination Dashboard provide...,3274
66,2021-2022,Were there any updates in the methods for flu ...,"During the 2021-2022 flu season, there were a ...",More information on flu surveillance methodolo...,Hospitals in all 50 states and U.S. territorie...,CDC also added a surveillance system that trac...,"CDC added another surveillance system, the HHS...",2665
67,2023-2024,Updates to U.S. Flu Surveillance Methods for t...,"Starting with the 2023-2024 influenza season, ...",Flu vaccination is often available at no or lo...,Flu vaccination is often available at no or lo...,Although monitoring influenza-only coded death...,"Starting with the 2023-2024 influenza season, ...",1118
68,2023-2024,B/Yamagata and Flu Vaccines Summary,Quadrivalent flu vaccines protect against four...,CDC is not involved in regulatory decision-mak...,Quadrivalent flu vaccines protect against four...,Quadrivalent flu vaccines protect against four...,All current flu vaccines in the United States ...,2397


In [12]:
df_qa.to_csv('data/flu_data_summary_qa_length.csv', index=False)