Regular Expression

In [1]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in fy2021 S1 was $8 billion.
'''

In [2]:
import re

In [3]:
pattern='FY\d{4} [A-Z0-9]*'
x=re.findall(pattern,text,flags=re.IGNORECASE)
print(x)

['FY2021 Q1', 'fy2021 S1']


In [4]:
pattern='\$[\d\.]+'
x=re.findall(pattern,text)
print(x)

['$4.85', '$8']


In [5]:
pattern='(FY\d{4} [A-Z][0-9]) [^\$]* (\$[\d\.]*)'
x=re.findall(pattern,text,flags=re.IGNORECASE)
print(x)

[('FY2021 Q1', '$4.85'), ('fy2021 S1', '$8')]


### Bag of words

- It is a method of extracting essential features from row text so that we can use it for machine learning models.
-  “Bag” of words because we discard the order of occurrences of words.
- It converts the raw text into words and also counts the frequency of words.


#### Example:
##### Sentences:
Jim and Pam traveled by bus.

The train was late.

The flight was full. Traveling by flight is expensive.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
sentences=["Jim and Pam traveled by bus.","The train was late.","The flight was full. Traveling by flight is expensive."]
cv=CountVectorizer()
b_o_w=cv.fit_transform(sentences).toarray()
# print(cv.get_feature_names())


In [12]:
print(cv.vocabulary_)

{'jim': 7, 'and': 0, 'pam': 9, 'traveled': 12, 'by': 2, 'bus': 1, 'the': 10, 'train': 11, 'was': 14, 'late': 8, 'flight': 4, 'full': 5, 'traveling': 13, 'is': 6, 'expensive': 3}


In [13]:
print(cv.get_feature_names_out())

['and' 'bus' 'by' 'expensive' 'flight' 'full' 'is' 'jim' 'late' 'pam'
 'the' 'train' 'traveled' 'traveling' 'was']


In [14]:
print(b_o_w)

[[1 1 1 0 0 0 0 1 0 1 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 1 0 1 1 0 0 1]
 [0 0 1 1 2 1 1 0 0 0 1 0 0 1 1]]


### TF-IDF(Term Frequncy-Inverse Document Frequency)

- Scoring measure genrally used in Information Retrieval(IR) and summarization
- importance or relevance of a term in a given document

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

![image-5.png](attachment:image-5.png)

![image-6.png](attachment:image-6.png)

![image-7.png](attachment:image-7.png)

![image-8.png](attachment:image-8.png)

![image-9.png](attachment:image-9.png)

![image-10.png](attachment:image-10.png)



In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
sentences=['This is the first document','This document is the second document']
vectorizer=TfidfVectorizer(norm=None)
x=vectorizer.fit_transform(sentences).toarray()


In [8]:
print(vectorizer.vocabulary_)

{'this': 5, 'is': 2, 'the': 4, 'first': 1, 'document': 0, 'second': 3}


In [10]:
print(vectorizer.get_feature_names_out())

['document' 'first' 'is' 'second' 'the' 'this']


In [11]:
print(x)

[[1.         1.40546511 1.         0.         1.         1.        ]
 [2.         0.         1.         1.40546511 1.         1.        ]]


![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)
