### Installation

To get started with Jupyter Notebook, follow these steps →

1. Open your browser and search for "Anaconda Navigator download for Windows". Click on the first link, which will take you to the Anaconda website. Download and install the software.
2. Once installed, open Anaconda Navigator. You'll see various software options like PyCharm and Jupyter Lab. Click on "Install" or "Launch" for Jupyter Notebook.
3. Launching Jupyter Notebook will open a localhost webpage. Localhost means it creates a local server on your system, allowing you to view and manage your files.
4. To create a new notebook, click "New" and select the desired kernel (in this case, Python). Your new notebook will have an. ipynb extension.

# Table of Content


### Module 1 - GenerativeAI Fundamentals

1. Introduction to AI and its Evolution
2. Machine Learning vs Deep Learning vs GenerativeAI
3. What is GenerativeAI and Its Real-World Applications
4. NLP Fundamentals and How to Build a Text Pre-processing Pipeline
5. Text Normalization and Tokenization
6. Embedding and Word2Vec

## 1. Introduction to AI and Its Evolution

<img src='g1.png' />

https://newsletter.himanshuramchandani.co/p/nlp-in-a-nutshell

In [1]:
print("Hello AI")

Hello AI


In [None]:
replicating human intelligence



1943 - A logical calculus of the ideas immanent in neurons activity - foundation of neural network

1950 - turing test

1956 - Artificial Intelligence - John McCarthy

1965 - NLP

1997 - IBM Deep blue defeats world champion

2004 - self-driving cars

2006 - deep learning - Geoffery Hinton

2007 - Apple - voice recognition (siri added in 2011)

2012 - AlexNet - deep CNN - computer vision

2014 - Google Deep mind - AlphaGo defeats a human through reinforcement learning

2017 - AlphaGo zero - just by self-play it defeats its predecessor without any human data.

2020 - GPT-3

2021 - Google - MUM (Multitask Unified Model)

2022 - DALL-E 2 and stable diffusion bring GenerativeAI

2023 - DeepMind - AlphaFold - accurately predicts the structure of all know proteins.

2024 - LLMs

## 2. Machine Learning vs Deep Learning vs GenerativeAI

<img src='g2.png' />

In [None]:
Machine Learning
- feature engineering
- structured data - tabular data, relational databse
- domain knowledge for designing features
- can be trained on small data
- CPUs
- understand the decision (not black box)

In [None]:
Deep Learning
- no need feature engineering
- unstructred data - images, audio, video, text
- minimum domain knowledge for manual feature engineering
- need large amount of data to perform well
- GPUs and TPUs
- it considered as black box - it is harder to explain how it comeup with the prediction

In [None]:
GenerativeAI

Statistical modeling
- generative modeling - generating numbers(probabilities)
- GPUs and TPUs
- need large and large amount of data (GPT3 - 500 Billion words/tokens)


## 3. What is GenerativeAI and Its Real-World Applications


### Text Generation

In [None]:
text generation
- ChatGPT
- Gemini
- Perplexity

text creation
- copy.ai

translation
- google translate
- deepL

### Coding

<img src='gif1.gif' />

In [None]:
- amazon codewhisperer
- github copilot

code documenation
- OpenAI codex



### Images and Videos

<img src='gif2.gif' />


In [None]:
image ai art
- DALL-E 2
- midhourney (join their discord for testing)

- canva

video generation
- deepfake
- ai video editing - RunwayML


### Audio

<img src='gif3.gif' />

In [None]:
text-to-speech
- murf.ai
- replica studios

music composition
- openAI jukedeck



In [None]:
What an AI engineer do exactly?

How much python a lead should know?

How much statistics a lead should know?



## 4. NLP Fundamentals and How to Build a Text Pre-processing Pipeline


<img src='n4.png' />

In [None]:
NLP

- bridge the gap between human and machine

2 types
- NLU(Natural Language Understanding) - semantic analytics(context and intent)
- NLG(Natural Language Generation) - generating next word

In [None]:
context 
"your t-shirt is killer"

intent
"my mom gave me money to buy 1kg tomatos othersiwe she will be angry"

### Text processing Pipeline

In [None]:
"A" = 65   # ASCII, utf-8

In [None]:
you cannot feed the model with internet data directly

In [None]:
India != INDIA  # the data is not normalized

In [None]:
# raw text

"<SUBJECT LINE> Employees details. \
<END><BODY TEXT>Attached are 2 files 1st, one is pairoll 2nd is healtcare !"

In [None]:
# remove encodings

"Employees details. Attached are 2 files 1st, one is pairoll 2nd is healtcare !"

In [None]:
# lower casing

"employees details. attached are 2 files 1st, one is pairoll 2nd is healtcare !"

In [None]:
# digits to words

"employees details. attached are two files first, one is pairoll second is healtcare !"

In [None]:
# remove special characters - @!#$%^

"employees details attached are two files first one is pairoll second is healtcare"

In [None]:
# spelling corrections

"employees details attached are two files first one is payroll second is healthcare"

In [None]:
# remove stop words

"employees details attached two files first one payroll second healthcare"

In [None]:
# stemming

"employe detail attached two file first one payroll second healthcare"

In [None]:
# lemmatization - ran->run, jumped->jump

"employe detail attach two file first one payroll second healthcare"

In [None]:
Now the text is ready to feed into a model

## 5. Text Normalization and Tokenization

Tokenization Example - https://platform.openai.com/tokenizer

<img src='v42.png' />

<img src='v43.jpg' />

In [None]:
this is an apple

4 token

### Tokenization

1. Word Tokenization
2. Sentence Tokenization
3. Regular Expression Tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [None]:
text = 'this is a single sentence.'

tokens = word_tokenize(text)

print(tokens) # ['this', 'is', 'a', 'single', 'sentence', '.']

In [8]:
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
no_punctuation

['this', 'is', 'a', 'single', 'sentence']

In [9]:
text = 'this is the first sentence. this is the second sentence. this is the document.'

print(sent_tokenize(text))

['this is the first sentence.', 'this is the second sentence.', 'this is the document.']


In [10]:
print([word_tokenize(sentence) for sentence in sent_tokenize(text)])

[['this', 'is', 'the', 'first', 'sentence', '.'], ['this', 'is', 'the', 'second', 'sentence', '.'], ['this', 'is', 'the', 'document', '.']]


In [11]:
stop_words = stopwords.words('english')

print(stop_words[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [12]:
text = 'this is the first sentence. this is the second sentence. this is the document.'

tokens = [token for token in word_tokenize(text) if token not in stop_words]

print(tokens)

['first', 'sentence', '.', 'second', 'sentence', '.', 'document', '.']


In [None]:
GPT3

500 Billion words

175 billion parameters

Tokens vs parameters

https://newsletter.himanshuramchandani.co/p/tokens-vs-parameters-in-llms

## 6. Embedding and Word2Vec

Dataset - https://nlp.stanford.edu/projects/glove/

In [21]:
import numpy as np

In [22]:
def loadGlove(path):
    file = open(path, 'r', encoding='utf8')
    model = {}
    
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word] = value
    
    return model

glove = loadGlove('glove.6B.50d.txt')

In [23]:
glove['python']   # vector embedding for the word Python

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

In [24]:
glove['neural']

array([ 0.92803 ,  0.29096 ,  0.67837 ,  1.0444  , -0.72551 ,  2.1995  ,
        0.88767 , -0.94782 ,  0.67426 ,  0.24908 ,  0.95722 ,  0.18122 ,
        0.064263,  0.64323 , -1.6301  ,  0.94972 , -0.7367  ,  0.17345 ,
        0.67638 ,  0.10026 , -0.033782, -0.76971 ,  0.40519 , -0.099516,
        0.79654 ,  0.1103  , -0.076053, -0.090434,  0.015021, -1.137   ,
        1.6803  , -0.34424 ,  0.77538 , -1.8718  , -0.17148 ,  0.31956 ,
        0.093062,  0.004996,  0.25716 ,  0.52207 , -0.52548 , -0.93144 ,
       -1.0553  ,  1.4401  ,  0.30807 , -0.84872 ,  1.9986  ,  0.10788 ,
       -0.23633 , -0.17978 ])

### How the system know that these words are similar?


Cosine Similarity

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
cosine_similarity(glove['cat'].reshape(1,-1), glove['dog'].reshape(1,-1))

array([[0.92180053]])

In [27]:
cosine_similarity(glove['cat'].reshape(1,-1), glove['piano'].reshape(1,-1))

array([[0.19825255]])

In [28]:
cosine_similarity(glove['king'].reshape(1,-1), glove['queen'].reshape(1,-1))

array([[0.7839043]])

## Words in 2D Embedding Space

<img src='v44.png' />

In [None]:
himanshu is taking the session. he will guide others.

In [None]:
place-delhi

In [None]:
NER
- names
- places
- organization
- quantities
- 

# Roadmap

Neurons to GenerativeAI - https://god-level-python.notion.site/Neurons-to-GenerativeAI-Live-Bootcamp-a59ec2f641084c488179271fc077f0c4?pvs=4

## Resources

Research Paper
- Attention is All You Need: https://arxiv.org/abs/1706.03762