## Text Cleaning

In [1]:
!pip install pandas



In [3]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.7.4-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp311-cp311-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp311-cp311-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.3-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.2-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.4.8-cp311-cp311-win_amd64.

In [4]:
import pandas as pd
import re
import string
import spacy
import unicodedata

### Import file

In [5]:
content = pd.read_csv('flu_data.csv')

### Stopwords removal

In [6]:
content

Unnamed: 0,Year,Content
0,2018-2019,What’s new this flu season? A few things are n...
1,2019-2020,What’s new this flu season? A few things are n...
2,2020-2021,2020-21 Flu Season Summary FAQ Summary What wa...
3,2021-2022,Summary What was the 2021-2022 flu season like...
4,2022-2023,What’s New for 2022-2023 A few things are diff...


In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK resources (if not already downloaded)
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CPE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\CPE\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
text = content['Content'].values[0]

## Text cleaning

### Clean punctuation

In [9]:
def clean_punctuation(text):
    punctuation_pattern = re.compile(r'[^\w\s]|_')
    cleaned_text = re.sub(punctuation_pattern, '', text)

    return cleaned_text

In [10]:
cleaned_text = clean_punctuation(text)
print(text)
print(cleaned_text)

What’s new this flu season? A few things are new this season: Flu vaccines have been updated to better match circulating viruses [the B/Victoria component was changed and the influenza A(H3N2) component was updated]. For the 2018-2019 season, the nasal spray flu vaccine (live attenuated influenza vaccine or “LAIV”) is again a recommended option for influenza vaccination of persons for whom it is otherwise appropriate. The nasal spray is approved for use in non-pregnant individuals, 2 to 49 years old. There is a precaution against the use of LAIV for people with certain underlying medical conditions. All LAIV will be quadrivalent (four-component). Most regular-dose egg-based flu shots will be quadrivalent. All recombinant vaccine will be quadrivalent. (No trivalent recombinant vaccine will be available this season.) Cell-grown flu vaccine will be quadrivalent. For this vaccine, the influenza A(H3N2) and both influenza B reference viruses will be cell-derived, and the influenza A(H1N1) w

### Lower case

In [11]:
lower_text = cleaned_text.lower()
print(cleaned_text)
print(lower_text)

Whats new this flu season A few things are new this season Flu vaccines have been updated to better match circulating viruses the BVictoria component was changed and the influenza AH3N2 component was updated For the 20182019 season the nasal spray flu vaccine live attenuated influenza vaccine or LAIV is again a recommended option for influenza vaccination of persons for whom it is otherwise appropriate The nasal spray is approved for use in nonpregnant individuals 2 to 49 years old There is a precaution against the use of LAIV for people with certain underlying medical conditions All LAIV will be quadrivalent fourcomponent Most regulardose eggbased flu shots will be quadrivalent All recombinant vaccine will be quadrivalent No trivalent recombinant vaccine will be available this season Cellgrown flu vaccine will be quadrivalent For this vaccine the influenza AH3N2 and both influenza B reference viruses will be cellderived and the influenza AH1N1 will be eggderived All these reference vi

### Character Normalization

In [79]:
def normalize_characters(text):
    normalized_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    return normalized_text

In [80]:
normalized_text = normalize_characters(lower_text)
print(lower_text)
print(normalized_text)

whats new this flu season a few things are new this season flu vaccines have been updated to better match circulating viruses the bvictoria component was changed and the influenza ah3n2 component was updated for the 20182019 season the nasal spray flu vaccine live attenuated influenza vaccine or laiv is again a recommended option for influenza vaccination of persons for whom it is otherwise appropriate the nasal spray is approved for use in nonpregnant individuals 2 to 49 years old there is a precaution against the use of laiv for people with certain underlying medical conditions all laiv will be quadrivalent fourcomponent most regulardose eggbased flu shots will be quadrivalent all recombinant vaccine will be quadrivalent no trivalent recombinant vaccine will be available this season cellgrown flu vaccine will be quadrivalent for this vaccine the influenza ah3n2 and both influenza b reference viruses will be cellderived and the influenza ah1n1 will be eggderived all these reference vi

## Text preprocessing

#### Lemmatization/Stemming

In [81]:
nlp = spacy.load("en_core_web_sm")

In [82]:
doc = nlp(normalized_text)

lemmatized_words = [token.lemma_ for token in doc]
lemmatized_text = ' '.join(lemmatized_words)

In [83]:
print(cleaned_text)
print(lemmatized_text)

Whats new this flu season A few things are new this season Flu vaccines have been updated to better match circulating viruses the BVictoria component was changed and the influenza AH3N2 component was updated For the 20182019 season the nasal spray flu vaccine live attenuated influenza vaccine or LAIV is again a recommended option for influenza vaccination of persons for whom it is otherwise appropriate The nasal spray is approved for use in nonpregnant individuals 2 to 49 years old There is a precaution against the use of LAIV for people with certain underlying medical conditions All LAIV will be quadrivalent fourcomponent Most regulardose eggbased flu shots will be quadrivalent All recombinant vaccine will be quadrivalent No trivalent recombinant vaccine will be available this season Cellgrown flu vaccine will be quadrivalent For this vaccine the influenza AH3N2 and both influenza B reference viruses will be cellderived and the influenza AH1N1 will be eggderived All these reference vi

### Text cleaning after pre-processing

#### Tokenization

In [84]:
def tokenization(text):
    text = text.translate(str.maketrans("", "", string.punctuation))
    tokens = text.split()
    return tokens

In [85]:
text_toknes = tokenization(lemmatized_text)
print(lemmatized_text)
print(text_toknes)

what s new this flu season a few thing be new this season flu vaccine have be update to well match circulate virus the bvictoria component be change and the influenza ah3n2 component be update for the 20182019 season the nasal spray flu vaccine live attenuate influenza vaccine or laiv be again a recommend option for influenza vaccination of person for whom it be otherwise appropriate the nasal spray be approve for use in nonpregnant individual 2 to 49 year old there be a precaution against the use of laiv for people with certain underlying medical condition all laiv will be quadrivalent fourcomponent most regulardose eggbase flu shot will be quadrivalent all recombinant vaccine will be quadrivalent no trivalent recombinant vaccine will be available this season cellgrown flu vaccine will be quadrivalent for this vaccine the influenza ah3n2 and both influenza b reference virus will be cellderive and the influenza ah1n1 will be eggderive all these reference virus will be grow in cell to p

#### Stopwords removal

In [86]:
def stopwords_removal(text: str):
    words = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return filtered_words

In [87]:
stopword_text = stopwords_removal(lemmatized_text)   
print(lower_text)
print(stopword_text)

whats new this flu season a few things are new this season flu vaccines have been updated to better match circulating viruses the bvictoria component was changed and the influenza ah3n2 component was updated for the 20182019 season the nasal spray flu vaccine live attenuated influenza vaccine or laiv is again a recommended option for influenza vaccination of persons for whom it is otherwise appropriate the nasal spray is approved for use in nonpregnant individuals 2 to 49 years old there is a precaution against the use of laiv for people with certain underlying medical conditions all laiv will be quadrivalent fourcomponent most regulardose eggbased flu shots will be quadrivalent all recombinant vaccine will be quadrivalent no trivalent recombinant vaccine will be available this season cellgrown flu vaccine will be quadrivalent for this vaccine the influenza ah3n2 and both influenza b reference viruses will be cellderived and the influenza ah1n1 will be eggderived all these reference vi

In [100]:
def save_to_df(content, years):
    data = {'Year': [], 'Content': []}
    for url, year in zip(content, years):
        data['Year'].append(year)
        data['Content'].append(content)
    
    return pd.DataFrame(data)

In [101]:
years = [
    "2018-2019",
    # "2019-2020",
    # "2020-2021",
    # "2021-2022",
    # "2022-2023"
]

In [102]:
df = save_to_df(stopword_text, years)

In [103]:
print(df)

        Year                                            Content
0  2018-2019  [new, flu, season, thing, new, season, flu, va...


In [93]:
df.to_csv('flu_data_token.csv', index=False)