# Weekly Learning Program (25/May/24 - 31/May/24)

I am starting this weekly Jupyter notebook series to help us improve our technical skills and become interview-ready for any internal job postings. Our ultimate goal is to prepare an end-to-end NLP model with minimal help from ChatGPT. The technical skills covered will include SQL, Python (pandas, numpy, nltk, and other data-related libraries), Excel,and Machine Learning.

* I will share a notebook every week that will have problem statements related to SQL, Python, and ML.
* Please download that notebook and complete the solution.
* Once done, Save that notebook to your GitHub project.

## How to save your projects on github

#### create account on hithub

* Log in to your GitHub account.
#### Create a New Repository

* Click on the "+" icon in the upper right corner and select "New repository".
* Enter a repository name, for example, weekly-notebooks.
* Optionally, add a description.
* Choose the visibility (public or private).
* Optionally, initialize the repository with a README file.
* Click on "Create repository".
* Uploading Your First Jupyter Notebook
* Open Your Repository

#### Navigate to your newly created repository.
* Upload Files

* Click on the "Add file" button and select "Upload files".
* Drag and drop your Jupyter notebook file (with a .ipynb extension) into the upload area, or click on "choose your files" to select the file from your computer.
* Add a commit message, for example, "Add first weekly notebook".
* Click on "Commit changes".

### Some Key Points to Remember

* Always include extensive comments in your solutions.
* It would be beneficial to explain the logic you used to solve each problem.
* Once finished, compare your solution with the best available solutions on the internet to enhance your problem-solving skills.
* If you get stuck, ask ChatGPT for hints rather than the complete solution.

## What Will the WLP (Weekly Learning Program) Notebook Include?

* 3-4 SQL questions
* 2-3 Python questions related to data manipulation (using numpy and pandas)
* Snippets and explanations related to NLP (to solve, read, or understand)
* Mathematical concepts used in ML

### Where to Solve the SQL Questions?

* I will provide the dataset along with the questions.
* You can use the following website to practice the SQL questions: [Programiz SQL Compiler](https://www.programiz.com/sql/online-compiler/)
* Please paste your solutions into the Jupyter notebook.

# SQL QUESTIONS

### 1. Given an Employee table with columns Id and Salary, write a query to find the second highest salary. If there is no second highest salary, the query should return NULL.

* Dataset 1: Contains Second Highest Salary

```sql
CREATE TABLE Employees (
    Id INT PRIMARY KEY,
    Salary INT
);

INSERT INTO Employees (Id, Salary) VALUES
(1, 150),
(2, 250),
(3, 350),
(4, 450);
```

* Dataset 2: No Second Highest Salary

```sql
CREATE TABLE Employees (
    Id INT PRIMARY KEY,
    Salary INT
);

INSERT INTO Employees (Id, Salary) VALUES
(1, 500),
(2, 500),
(3, 500);
```

#### Solution below
* ##using rank function to find the rank of salary
* ##using case statement to find if rank 2 is present
```
with second as
 (
   select *,rank()over(order by salary) rnk
   from employees
 )
 select case when rnk = 2 then salary else null end as second_highest_salary
 from second
 ```

### 2. Write a SQL query to get the nth highest salary from the Employees table. For example, given the Employee table below, the nth highest salary where n = 2 is 250. If there is no nth highest salary, the query should return NULL. PLease use parameter to perform this task

* Dataset 1:
  
``` CREATE TABLE Employees (
    Id INT PRIMARY KEY,
    Salary INT
);

INSERT INTO Employees (Id, Salary) VALUES
(1, 150),
(2, 250),
(3, 350),
(4, 450);
```




* Dataset 2:

```CREATE TABLE Employees (
    Id INT PRIMARY KEY,
    Salary INT
);

INSERT INTO Employees (Id, Salary) VALUES
(1, 400),
(2, 400),
(3, 400);
```



#### Solution below
```
DECLARE @rnk int = 2;
with second as
 (
   select *,rank()over(order by salary) rnk
   from employees
 )
 select case when rnk = @rnk then salary else null end as second_highest_salary
 from second
```

### 3. Write a SQL query to find all numbers that appear at least three times consecutively in the Logs table. For example, given the Logs table below, the number 3 appears consecutively at least three times.


```
CREATE TABLE Logs (
    Id INT PRIMARY KEY,
    Num INT
);

INSERT INTO Logs (Id, Num) VALUES
(1, 2),
(2, 2),
(3, 2),
(4, 3),
(5, 3),
(6, 3),
(7, 1)
(8, 2);

```

#### Solution below

```
with
rnk
as
	(
      select id,num,rank()over(partition by num order by id) rnk,
             id-rank()over(partition by num order by id) diff
      from logs
    )
select num,diff,count(diff) consecutive
from rnk
group by num,diff
having consecutive >=3
order by id

'''

# PANDAS

### 1: Calculate the Mean Salary by Department and Filter Departments with Average Salary Above a Threshold

In [11]:
import pandas as pd
import numpy as np
# Sample data
data = {
    'EmployeeId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [4000, 5000, 4200, 5500, 4500, 6000, 4800, 6200, 4700, 6300]
}
df = pd.DataFrame(data)


In [18]:
#SOLUTION HERE
grouped_dept_mean = df.groupby('Department')['Salary'].mean()
print(grouped_dept_mean)
threshold = 5000
filtered_grouped_dept_mean = grouped_dept_mean[grouped_dept_mean>threshold]
filtered_grouped_dept_mean

Department
HR       4350.000000
IT       5100.000000
Sales    6166.666667
Name: Salary, dtype: float64


Department
IT       5100.000000
Sales    6166.666667
Name: Salary, dtype: float64

### 2. Write a pandas code snippet to filter the rows where the Age is greater than 30 and the Salary is less than 5000.

In [19]:
import pandas as pd

# Sample data
data = {
    'EmployeeId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Age': [25, 30, 45, 28, 32, 41, 34, 29, 30, 31],
    'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [4000, 5000, 4200, 5500, 4500, 6000, 4800, 6200, 4700, 6300]
}
df = pd.DataFrame(data)


In [23]:
#Solution here
df[(df['Age']>30) & (df['Salary']<5000)]

Unnamed: 0,EmployeeId,Age,Department,Salary
2,3,45,HR,4200
4,5,32,HR,4500
6,7,34,IT,4800


### 3. Write a pandas code snippet to add a new column SalaryLevel to the DataFrame df. The SalaryLevel column should contain 'High' if the Salary is greater than 5000, 'Medium' if the Salary is between 4500 and 5000, and 'Low' if the Salary is less than 4500.

In [24]:
import pandas as pd

# Sample data
data = {
    'EmployeeId': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [4000, 5000, 4200, 5500, 4500, 6000, 4800, 6200, 4700, 6300]
}
df = pd.DataFrame(data)


In [25]:
# your code here
df['SalaryLevel'] = df['Salary'].apply(lambda x : 'High' if x > 5000 else ('Medium' if x>4500 else 'Low'))
print(df)

   EmployeeId Department  Salary SalaryLevel
0           1         HR    4000         Low
1           2         IT    5000      Medium
2           3         HR    4200         Low
3           4         IT    5500        High
4           5         HR    4500         Low
5           6      Sales    6000        High
6           7         IT    4800      Medium
7           8      Sales    6200        High
8           9         HR    4700      Medium
9          10      Sales    6300        High


# NLP & NLP MODELS

### What is NLP?

**Natural Language Processing (NLP)** is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It involves the development of algorithms and models that enable computers to understand, interpret, and generate human language.

### Different Models in NLP

#### Basic Models

1. **Bag of Words (BoW)**
   - Represents text data as a collection of words, disregarding grammar and word order.
   - Uses word frequencies or occurrences.

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**
   - Enhances the BoW model by considering the importance of words.
   - Weights words based on their frequency in a document and their rarity across documents.

3. **n-Grams**
   - Considers sequences of n words to capture context.
   - Common examples include bigrams (2 words) and trigrams (3 words).

4. **Latent Semantic Analysis (LSA)**
   - Reduces dimensionality using Singular Value Decomposition (SVD).
   - Captures hidden semantic structures in text data.

#### Intermediate Models

5. **Word Embeddings**
   - **Word2Vec**: Uses neural networks to learn word associations.
   - **GloVe**: Combines global word co-occurrence statistics with local context-based learning.
   - **FastText**: Extends Word2Vec by considering subword information.

6. **Topic Modeling**
   - **Latent Dirichlet Allocation (LDA)**: Probabilistic model that discovers topics within a collection of documents.
   - **Non-Negative Matrix Factorization (NMF)**: Matrix decomposition technique for identifying latent topics.

#### Advanced Models

7. **Recurrent Neural Networks (RNNs)**
   - Captures sequential dependencies in text.
   - **Long Short-Term Memory (LSTM)**: Addresses the vanishing gradient problem in RNNs.
   - **Gated Recurrent Units (GRUs)**: Simplified version of LSTM with fewer parameters.

8. **Convolutional Neural Networks (CNNs)**
   - Originally designed for image processing but effective for text classification tasks.

9. **Attention Mechanisms**
   - Enhances the ability of models to focus on relevant parts of the input sequence.
   - **Bahdanau Attention**: An early form of attention in neural machine translation.

#### Latest Models

10. **Transformers**
    - Revolutionized NLP by introducing self-attention mechanisms.
    - Examples include:
      - **BERT (Bidirectional Encoder Representations from Transformers)**
      - **GPT (Generative Pre-trained Transformer)**
      - **RoBERTa (Robustly optimized BERT approach)**
      - **XLNet**
      - **T5 (Text-To-Text Transfer Transformer)**
      - **ALBERT (A Lite BERT)**
      - **BART (Bidirectional and Auto-Regressive Transformers)**
      - **DistilBERT**: A smaller, faster, and lighter version of BERT.
      - **ERNIE (Enhanced Representation through kNowledge Integration)**

11. **Multimodal Models**
    - Combines text with other data types like images.
    - Examples include:
      - **CLIP (Contrastive Language–Image Pre-training)**
      - **DALL-E**: Generates images from textual descriptions.

12. **Large Language Models (LLMs)**
    - Extensive pre-training on diverse datasets.
    - Examples include:
      - **GPT-3 and GPT-4**: Very large models with billions of parameters.
      - **PaLM (Pathways Language Model)**
      - **ChatGPT**: Fine-tuned versions of GPT models for conversational tasks.
      - **LLaMA (Large Language Model Meta AI)**

These models represent the evolution of NLP from simple word-based representations to complex, deep learning architectures capable of understanding and generating human language with high accuracy.