### Learning how to work with data on webpages

In [1]:
from langchain.document_loaders import UnstructuredURLLoader

In [2]:
loader = UnstructuredURLLoader(urls = [
            "https://www.moneycontrol.com/news/business/markets/hdfc-bank-gains-after-the-big-fall-clsa-says-buy-fiis-warming-up-to-the-stock-12082451.html",
            "https://www.moneycontrol.com/news/business/stocks/buy-hdfc-bank-target-of-rs-1950-kr-choksey-12080431.html"])

data = loader.load()

len(data)

2

In [3]:
data[0]

Document(page_content='English\n\nHindi\n\nGujarati\n\nSpecials\n\nMoneycontrol Trending Stock\n\nInfosys\xa0INE009A01021, INFY, 500209\n\nState Bank of India\xa0INE062A01020, SBIN, 500112\n\nYes Bank\xa0INE528G01027, YESBANK, 532648\n\nBank Nifty\n\nNifty 500\n\nQuotes\n\nMutual Funds\n\nCommodities\n\nFutures & Options\n\nCurrency\n\nNews\n\nCryptocurrency\n\nForum\n\nNotices\n\nVideos\n\nGlossary\n\nAll\n\nHello, LoginHello, LoginLog-inor Sign-UpMy AccountMy Profile My PortfolioMy WatchlistMy Credit Score₹100 CashbackMy FeedMy MessagesMy AlertsMy Profile My PROMy PortfolioMy WatchlistMy Credit Score₹100 CashbackMy FeedMy MessagesMy AlertsLogoutChat with UsDownload AppFollow us on:\n\nPremium\n\nMy Feed\n\nBudget 2        24MarketsHOMEINDIAN INDICESSTOCK ACTIONAll StatsTop GainersTop LosersOnly BuyersOnly Sellers52 Week High52 Week LowPrice ShockersVolume ShockersMost Active StocksGLOBAL MARKETSUS MARKETSBIG SHARK PORTFOLIOSSTOCK SCANNERECONOMIC CALENDARMARKET ACTIONDashboardF&OFII &

### Creating chunks of text using TextSplitter

In [4]:
from langchain.text_splitter import CharacterTextSplitter

In [5]:
wiki_text = """Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. 
Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. 
Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.

Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $677 million worldwide ($715 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. 
It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades."""


In [6]:
splitter = CharacterTextSplitter(separator = "\n", chunk_size = 200, chunk_overlap = 0)

chunks = splitter.split_text(wiki_text)

len(chunks)

Created a chunk of size 210, which is longer than the specified 200
Created a chunk of size 208, which is longer than the specified 200
Created a chunk of size 358, which is longer than the specified 200


9

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
rec_splitter = RecursiveCharacterTextSplitter(separators = ["\n\n", "\n", " "], chunk_size = 200, chunk_overlap = 0)

chunks = rec_splitter.split_text(wiki_text)

len(chunks)

13

### Understanding FAISS (Lightweight in memory vector database)
FAISS wil perform an efficient search for the vector of the search query and return those chunks (vectors) that have high similarity to the search query

In [11]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

In [12]:
df = pd.read_csv("sample_text.csv")
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [13]:
from sentence_transformers import SentenceTransformer

#Hugging Face Sentence Transformer
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)
vectors.shape

(8, 768)

In [16]:
dim = vectors.shape[1]
dim

768

In [17]:
import faiss
#Creating an index that allows faster search later. Uses Euclidean distance
index = faiss.IndexFlatL2(dim)
index

<faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x0000021D96BCE8D0> >

#### Adding the vectors to the index to construct a data structure for efficient similarity search

In [18]:
index.add(vectors)

### Performing a search

In [19]:
search_query = "I want to buy a polo tshirt"

search_vector = encoder.encode(search_query)

search_vector.shape

(768,)

In [21]:
# Search expects a 2D array so covert using numpy
import numpy as np

search_vector = np.array(search_vector).reshape(1, -1)
search_vector.shape

(1, 768)

In [22]:
index.search(search_vector, k = 2)

(array([[1.3101504, 1.3429667]], dtype=float32), array([[2, 3]], dtype=int64))

The indices returned correspond the to the fashion labelled text in the dataframe

In [23]:
dist, idx = index.search(search_vector, k = 2)

df.loc[idx[0]]

Unnamed: 0,text,category
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
