$\textbf{Text Vectorization}$
-

- a vector is a geometric object which contains a magnitude and a direction.

- Text vectorization is the projection of words into a mathematical space while preserving information.

$\textbf{The Bag of Words Model}$
-

- The BOW is a straight forward model for vectorizing sentences.

- BOW uses word frequencies to construct vectors.

- BOW model is an orderless document representation and only the counts of the words matter.

- Because BOW does not take into account the positioning of words we loss smenatic information.

- Vectorizing different sentences and joining the result into a single vocabulary.

- The vocabulary acts as a reference if a specific word is present or absent in each of the sentence.

$EXAMPLE$

In [39]:
import re
import string

s1 = "dog sat mat."
s2 = "cat love dog."

def token_sentence(s):
    # Make a regular expression that matches all punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Use the regex
    res = regex.sub('', s)
    res = res.split()
    return res

new_s1 = token_sentence(s1)
new_s2 = token_sentence(s2)
vocabulary = list(set(new_s1 + new_s2))
vocabulary

['dog', 'love', 'mat', 'cat', 'sat']

In [40]:
new_s1

['dog', 'sat', 'mat']

In [41]:
BOW = [int(u in new_s1) for u in vocabulary]
BOW

[1, 0, 1, 0, 1]

$\text{Term Frequency Inverse Document Frequency (TF-IDF)}$
-

- A model largely used in search engines to query relevant documents.

- Two informations are encoded: the term frequency, and the inverse document frequency.

- The term frequency is the count of words appearing in a document.

- The inverse document frequency measures the importance of words in a document.

- The inverse document frequency is calculated by logarithmically scaling the inverse fraction of the documents containing the word. This is obtained by dividing the total number of documents by the number of documents containing the term, followed by taking the logarithm of the ratio.

- The inverse document frequency measures how common or rare a term is among all documents.

The formula are:
\begin{gather}
TF(t) = \frac{\text{number of times the term "t" appeas in a specific document}}{\text{total number of terms in the document}}
\end{gather}

\begin{gather}
IDF(t) = log(\frac{\text{total number of documents}}{\text{number of documents with term "t"}})
\end{gather}

\begin{gather}
TF \cdotp IDF = TF(t) \cdotp IDF(t)
\end{gather}

- TF-IDF has more information that using vector representation because instead of using the count of words as used in the BOW, TF-IDF makes rare terms more prominent and ignores common words like stopwords such as "is", "that", "of", etc.

$\text{Vectorization Using Gensim}$

In [42]:
from gensim import corpora
import spacy
# from pypdf import PdfReader 
nlp = spacy.load('en_core_web_sm')

# article1 = PdfReader("references/PCOS_BeyondTheBasics.pdf");
# article2 = PdfReader("references/PCOS_CharacterizationThroughFloApp.pdf");
# article3 = PdfReader("references/PCOS_Diabetes.pdf");
# article4 = PdfReader("references/PCOS_HopkinsMedicine.pdf");
# article5 = PdfReader("references/PCOS_ReviewOfTreatmentOptions.pdf");
# article6 = PdfReader("references/PCOS_WhatIsPCOS.pdf");

text1 = "The changes in hormone levels described above cause the classic symptoms of PCOS, including absent or irregular and infrequent menstrual periods, increased body hair growth or scalp hair loss, acne, and difficulty becoming pregnant. Signs and symptoms of PCOS usually begin around the time of puberty, although some females do not develop symptoms until late adolescence or even into early adulthood. Because hormonal changes vary from one female to another, patients with PCOS may have mild to severe acne, facial hair growth, or scalp hair loss. Menstrual irregularity — If ovulation does not occur, the ovaries do not produce progesterone, and the lining of the uterus (called the endometrium) becomes thicker and may shed irregularly, which can result in heavy and/or prolonged bleeding. Irregular or absent menstrual periods can increase a female's risk of endometrial overgrowth (called endometrial hyperplasia) or even endometrial cancer. Females with PCOS usually have fewer than six to eight menstrual periods per year. Weight gain and obesity — PCOS is associated with gradual weight gain and obesity in approximately one-half of females. For some females with PCOS, obesity develops at the time of puberty. Hair growth and acne — Male-pattern hair growth (hirsutism) may be seen on the upper lip, chin, neck, sideburn area, chest, upper or lower abdomen, upper arm, and inner thigh. Acne is a skin condition that causes oily skin and blockages in hair follicles. Infertility — Many females with PCOS do not ovulate regularly, and it may take these females longer to become pregnant. For females with PCOS who desire pregnancy but have irregular periods, the fertility evaluation should start immediately as the chance of becoming pregnant is low without treatment. (See 'Treatment of infertility' below.) Heart disease — Females who are obese and who also have insulin resistance or diabetes might have an increased risk of coronary artery disease, which increases the risk of having a heart attack. It is not known with certainty if females with PCOS are at increased risk for this condition. Both weight loss and treatment of insulin abnormalities can decrease this risk. Other treatments (eg, cholesterol-lowering medications [statins], and treatments for high blood pressure) may also be recommended. Sleep apnea — Sleep apnea is a condition that causes brief spells where breathing stops (apnea) during sleep. Patients with this problem often experience fatigue and daytime sleepiness. In addition, there is evidence that people with untreated sleep apnea have an increased risk of insulin resistance, obesity, diabetes, and cardiovascular problems, such as high blood pressure, heart attack, abnormal heart rhythms, or stroke. Sleep apnea may occur in up to 50 percent of females with PCOS. The condition can be diagnosed with a sleep study, and several treatments are available. Other problems — Females with PCOS are at increased risk of other problems that can impact quality of life. These include: Depression and anxiety – There are treatments that can help with these problems, including therapy as well as medications. Sexual dysfunction – Females with PCOS are more likely than other females to experience lower sexual satisfaction. Eating disorders – These include bulimia and binge eating. Females with PCOS do not appear to be at increased risk of developing anorexia If you think you might be experiencing any of these problems, talk with your health care provider. There are often treatments that can help. Symptoms after menopause — Less is known about PCOS symptoms after menopause. Research suggests that females with PCOS may continue to have high androgen levels after menopause (when monthly periods normally stop), but that they decline to normal after approximately age 70. However, even females who have been through menopause and whose hormone levels are returning to normal can have symptoms like excess hair growth."
text2 = "Using Flo’s technology, we analyzed the largest known PCOS symptom dataset to obtain a comprehensive understanding of the distribution of PCOS and its varying phenotypes worldwide. Among all countries, the highest ratio of PCOS positive to PCOS negative users occurred in Trinidad and Tobago, Philippines, United Arab Emirates, India, Jamaica, UK, followed by the US. The US, UK, India, Philippines and Australia had the greatest number of respondents to the PCOS dialog box. Within these top five countries, the most prevalent predictors of PCOS were bloating, both high cholesterol and glucose, and high glucose alone. Additionally, among four of the top five countries, bloating was the most frequently reported symptom. When examining BMI in relation to PCOS, there is a trend that as BMI increased, the percentage of women with a self-reported PCOS diagnosis also increased. However, women in India did not follow this trend as there was no significant relationship between BMI and PCOS status. Previous research on global PCOS symptomatology is both limited and inconsistent. Many have identified South Asian women to have among the lowest prevalence rates, yet this group has been found to have high rates of insulin resistance and metabolic syndrome. Another study found that 52% of women residing in India present with PCOS, which is the highest reported prevalence internationally. Consistent with this report, our findings show that India and the Philippines were among the top countries with high ratios of PCOS positive to PCOS negative users. Another study analyzing PCOS phenotypes among different populations reported that women with PCOS from Asia and America were at increased risk of type II diabetes. Their PCOS is most often characterized by insulin resistance, high BMI, or central obesity while Europeans and Middle Eastern women often experience androgenic alopecia, hirsutism, and hyperandrogenism. In our sample from India, individuals who reported both high cholesterol and high glucose were almost three times more likely to self-report having PCOS. However, respondents in the US and UK with symptoms of both high cholesterol and high glucose were almost four times more likely, and those in Australia almost five times more likely to report PCOS. Moreover, East Asian women with PCOS often have a milder hyperandrogenic phenotype and lower BMI compared to others, but have the highest prevalence of metabolic syndrome. Kumarapeli et al. studied a semi urban population in Sri Lanka and found that of women with self-reported oligo/amenorrhea or hirsutism, over 90% had a confirmed PCOS diagnosis. These women tend to have less hirsutism compared to women from Europe and the US. It is known that the prevalence of PCOS is increased in overweight and obese women, and that obesity prevalence has globally increased in the last few decades. Our results also reveal that as BMI increased, the proportion of women with a PCOS diagnosis also increased. However, obesity prevalence is highly variable by age, ethnicity, and geographical location. In the US and UK, obese women were twice as likely to have PCOS compared to those of normal weight; while there were no observed trends in BMI and odds of PCOS diagnosis in India. Geographic differences in the prevalence of obesity is likely a result of the interaction between individual factors (e.g. genetic) and environmental factors (e.g. food supply). A previous meta-analysis indicated that an increased risk of obesity exists for Caucasian women from the US and Europe compared to Asian women from China and Taiwan, suggesting a difference in the nature of PCOS based on location. Understanding such geographical differences in PCOS as it relates to BMI is critical for countries where in creased obesity exists, as overweight and obese PCOS patients are more likely to exhibit clinical signs of androgen excess, significantly more severe insulin resistance, as well as anxiety and depression. Compared to the NIH diagnostic criteria, the more expansive definition and inclusion of additional phenotypes of the Rotterdam and AES criteria may explain the greater estimates of PCOS prevalence. When using the same defining criteria, variations in the reported prevalence across countries can in part be explained by ethnic differences, by the approaches used to define study population(s), and the application of varying methods to evaluate key PCOS features. In surveying the largest known sample on identified PCOS symptoms, we are able to provide evidence that the symptomatology may be more complex than previously understood. Within the top five countries, our most frequently reported symptoms were bloating, facial hirsutism, irregular cycles, hyperpigmentation, and baldness. The symptoms reported in our sample are broader than those included in the Rotterdam criteria, suggesting more work and further research is needed to reevaluate and refine PCOS diagnostic criteria. Also, the most frequently reported symptoms of PCOS varied across countries, suggesting the presence of environmental and/or genetic effects on the PCOS phenotype. We find it unique and interesting that although women in India had the most frequently reported diagnosis of PCOS (22.7%), these women exhibited a different phenotype relative to women from the other countries. Women in India with PCOS were significantly less likely to experience bloating relative to the other countries examined. Furthermore, women in India with PCOS uniquely did not exhibit a significant relationship between BMI and PCOS status. Possible reasons for this unique phenotype may include differences in genetics, diet, and environmental exposures. Further research is needed to confirm and better understand these differences. Interestingly, we found that symptoms were similar between US/UK and between India/Philippines - countries that are socio-demographically similar. Gynecological and reproductive education delivered through apps has potential to improve physician to patient interactions, while also providing large quantities of menstrual cycle and related data. There are over a hundred female health and wellbeing apps with more than 200 million downloads. As such, medical professionals and researchers can gather information from large, unselected patient populations like ours in order to improve the understanding of gynecological disorders such as PCOS. Flo and other fertility apps can also provide public health benefits by offering standardized health promotion messages during various stages of reproductive life. Strengths of our study included a very large global sample of medically unbiased women. A limitation is that women who already have certain medical conditions may have been more likely to participate in the dialog. In addition to the fact data were self-reported, women who said they did not a have physician-confirmed PCOS diagnosis may have another reproductive disorder which might be symptomatically similar to PCOS. It is also possible that different countries use different diagnostic criteria and medical professionals may have different approaches to PCOS diagnosis. Lastly, the dialog was available to Flo users running the app in English, which limited representation especially in countries that are not predominantly English speaking."
text3 = "Irregular periods or no periods, caused from lack of ovulation. Higher than normal levels of male hormones that may result in excess hair on the face and body, acne, or thinning scalp hair. Multiple small cysts on the ovaries"
text4 = "Missed periods, irregular periods, or very light periods Ovaries that are large or have many cysts Excess body hair, including the chest, stomach, and back (hirsutism) Weight gain, especially around the belly (abdomen) Acne or oily skin Male-pattern baldness or thinning hair Infertility Small pieces of excess skin on the neck or armpits (skin tags)";
text5 = "Enlarged ovaries with numerous small cysts Irregular menstrual cycles Pelvic pain Hirsutism Alopecia Acne Acanthosis nigricans Skin tags";
text6 = "Irregular periods. A lack of ovulation prevents the uterine lining from shedding every month. Some women with PCOS get fewer than eight periods a year or none at all. Heavy bleeding. The uterine lining builds up for a longer period of time, so the periods you do get can be heavier than normal. Hair growth. More than 70 percent of women with this condition grow hair on their face and body — including on their back, belly, and chest. Excess hair growth is called hirsutism. Acne. Male hormones can make the skin oilier than usual and cause breakouts on areas like the face, chest, and upper back. Weight gain. Up to 80 percent of women with PCOS are overweight or have obesity. Male pattern baldness. Hair on the scalp gets thinner and may fall out. Darkening of the skin. Dark patches of skin can form in body creases like those on the neck, in the groin, and under the breasts. Headaches. Hormone changes can trigger headaches in some women."

documents = [text1, text2, text3, text4, text5, text6];

In [43]:
texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
#texts is a mini-corpus specifically for toxic algal bloom
print(texts)

[['change', 'hormone', 'level', 'describe', 'cause', 'classic', 'symptom', 'PCOS', 'include', 'absent', 'irregular', 'infrequent', 'menstrual', 'period', 'increase', 'body', 'hair', 'growth', 'scalp', 'hair', 'loss', 'acne', 'difficulty', 'pregnant', 'sign', 'symptom', 'PCOS', 'usually', 'begin', 'time', 'puberty', 'female', 'develop', 'symptom', 'late', 'adolescence', 'early', 'adulthood', 'hormonal', 'change', 'vary', 'female', 'patient', 'PCOS', 'mild', 'severe', 'acne', 'facial', 'hair', 'growth', 'scalp', 'hair', 'loss', 'menstrual', 'irregularity', 'ovulation', 'occur', 'ovary', 'produce', 'progesterone', 'lining', 'uterus', 'call', 'endometrium', 'thick', 'shed', 'irregularly', 'result', 'heavy', 'and/or', 'prolong', 'bleeding', 'irregular', 'absent', 'menstrual', 'period', 'increase', 'female', 'risk', 'endometrial', 'overgrowth', 'call', 'endometrial', 'hyperplasia', 'endometrial', 'cancer', 'female', 'PCOS', 'usually', 'few', 'menstrual', 'period', 'year', 'weight', 'gain', '

In [44]:
#creating a BOW representation of the mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'Acne': 0, 'Females': 1, 'PCOS': 2, 'abdoman': 3, 'abnormal': 4, 'abnormality': 5, 'absent': 6, 'acne': 7, 'addition': 8, 'adolescence': 9, 'adulthood': 10, 'age': 11, 'and/or': 12, 'androgen': 13, 'anorexia': 14, 'anxiety': 15, 'apnea': 16, 'appear': 17, 'approximately': 18, 'area': 19, 'arm': 20, 'artery': 21, 'associate': 22, 'attack': 23, 'available': 24, 'begin': 25, 'binge': 26, 'bleeding': 27, 'blockage': 28, 'blood': 29, 'body': 30, 'breathe': 31, 'brief': 32, 'bulimia': 33, 'call': 34, 'cancer': 35, 'cardiovascular': 36, 'care': 37, 'cause': 38, 'certainty': 39, 'chance': 40, 'change': 41, 'chest': 42, 'chin': 43, 'cholesterol': 44, 'classic': 45, 'condition': 46, 'continue': 47, 'coronary': 48, 'daytime': 49, 'decline': 50, 'decrease': 51, 'depression': 52, 'describe': 53, 'desire': 54, 'develop': 55, 'diabete': 56, 'diabetes': 57, 'diagnose': 58, 'difficulty': 59, 'disease': 60, 'disorder': 61, 'dysfunction': 62, 'early': 63, 'eat': 64, 'eating': 65, 'eg': 66, 'endometrial'

$INSIGHTS$

- There are 472 unique words in our corpus that is focused on PCOS and its symptoms.

- Each word is indexed with an integer.

- The index is termed as a "word ID".

- The BOW now can be used for word integer-id mapping.

Using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [45]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1),
  (1, 1),
  (2, 15),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 2),
  (7, 3),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 5),
  (17, 1),
  (18, 2),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 2),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 3),
  (39, 1),
  (40, 1),
  (41, 2),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 4),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 3),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 2),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 3),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 3),
  (73, 1),
  (74, 1),
  (75, 17),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 2),
  (80, 1),
  (81, 5),
  (82, 8),
  (83, 1),
  (84, 1),
  (85, 1),
  (86, 4),
  (87, 1),
  (88, 2),
  (89, 3),
  (90, 1),
  (91, 

- The output is a nested list.

- Each individual sublist represents a documents bag-of-words representation.

- A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur.

- Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count).

- We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list.

- We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

In [46]:
#storing your generated corpus

corpora.MmCorpus.serialize('1_PCOS_Corpus.mm', corpus)

- It is more memory efficient to store your corpus into the disk and later loading it because at most one vector resides in the RAM at a time.

In [47]:
#Converting Bag-of-Words to TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)


for document in tfidf[corpus]:
       print(document)

[(0, 0.0152758784843908), (1, 0.039487573047723344), (2, 0.229138177265862), (3, 0.039487573047723344), (4, 0.039487573047723344), (5, 0.039487573047723344), (6, 0.07897514609544669), (7, 0.0458276354531724), (8, 0.02421169456333255), (9, 0.039487573047723344), (10, 0.039487573047723344), (11, 0.02421169456333255), (12, 0.02421169456333255), (13, 0.02421169456333255), (14, 0.039487573047723344), (15, 0.02421169456333255), (16, 0.19743786523861673), (17, 0.039487573047723344), (18, 0.07897514609544669), (19, 0.02421169456333255), (20, 0.039487573047723344), (21, 0.039487573047723344), (22, 0.039487573047723344), (23, 0.07897514609544669), (24, 0.02421169456333255), (25, 0.039487573047723344), (26, 0.039487573047723344), (27, 0.02421169456333255), (28, 0.039487573047723344), (29, 0.07897514609544669), (30, 0.008935816078941746), (31, 0.039487573047723344), (32, 0.039487573047723344), (33, 0.039487573047723344), (34, 0.0484233891266651), (35, 0.039487573047723344), (36, 0.0394875730477233

- TF-IDF scores: The higher the score, the more important the word in the document.

$\textbf{N-Gramming}$
-

- Context is very important when working with text data.
- This context is lost during vector representation because on only the word frequency is taken into account.
- An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
- Mono-gram, n=1
- Bi-gram, n = 2.
- Tri-gram, n=3
- N-Gramming is calculated through the conditional probability of a token given by thr preceding token.
- N-Gramming can also be done by calculating words that appear close to each other.
- Bi-gramming is also called co-location, it locates pair of words that are very likely to appear close together.
- Example: "New Hampshire" is one word not "New" and "Hampshire"
- Gensim approaches bigrams by simply combining the two high probability tokens with an underscore. The tokens new and york will now become new_york instead. Similar to the TF- IDF model, bigrams can be created using another Gensim model - Phrases.

In [48]:
import gensim
bigram = gensim.models.Phrases(texts)
texts = [bigram[line] for line in texts]
texts

[['change',
  'hormone',
  'level',
  'describe',
  'cause',
  'classic',
  'symptom',
  'PCOS',
  'include',
  'absent',
  'irregular',
  'infrequent',
  'menstrual',
  'period',
  'increase',
  'body',
  'hair_growth',
  'scalp',
  'hair',
  'loss',
  'acne',
  'difficulty',
  'pregnant',
  'sign',
  'symptom',
  'PCOS',
  'usually',
  'begin',
  'time',
  'puberty',
  'female',
  'develop',
  'symptom',
  'late',
  'adolescence',
  'early',
  'adulthood',
  'hormonal',
  'change',
  'vary',
  'female',
  'patient',
  'PCOS',
  'mild',
  'severe',
  'acne',
  'facial',
  'hair_growth',
  'scalp',
  'hair',
  'loss',
  'menstrual',
  'irregularity',
  'ovulation',
  'occur',
  'ovary',
  'produce',
  'progesterone',
  'lining',
  'uterus',
  'call',
  'endometrium',
  'thick',
  'shed',
  'irregularly',
  'result',
  'heavy',
  'and/or',
  'prolong',
  'bleeding',
  'irregular',
  'absent',
  'menstrual',
  'period',
  'increase',
  'female',
  'risk',
  'endometrial',
  'overgrowth',

$\textbf{NOTE}$: Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [49]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

After we are done creating our bi-grams, we can create tri-grams, and other n-grams by simply running the phrases model multiple times on our corpus. Bi-grams still remains the most used n-gram model, though it is worth one's time to glance over the other uses and kinds of n-gram implementations

In [50]:
# Removing both high frequency and low-frequency words.
# Example: get rid of words that occur in less than 20 documents, or in more than 50% of the documents, 
dictionary.filter_extremes(no_below=20, no_above=0.5)

$\textbf{Programming Assignment}$

Choose a topic that you will be using as a term paper for this subject. Collect articles, publications, sotries etc. of your chosen topic and develop your own mini-corpus using the preprocessing steps required. Be sure to print the output.

Note that this corpus will be used for the entire subject.