# Weekly Learning Program (01/June/24 - 07/June/24)

### Where to Solve the SQL Questions?

* I will provide the dataset along with the questions.
* You can use the following website to practice the SQL questions: [Programiz SQL Compiler](https://www.programiz.com/sql/online-compiler/)
* Please paste your solutions into the Jupyter notebook.

### Instruction for this
* Read the articles below(Dimension and BOW)
* Answer the questions

## SQL QUESTIONS

#### 1. Given an Employee table containing the Id, Name, Salary, and ManagerId columns, write a SQL query to find the names of employees who earn more than their managers.

``` sql
CREATE TABLE Employee (
    Id INT PRIMARY KEY,
    Name NVARCHAR(50),
    Salary INT,
    ManagerId INT
);

INSERT INTO Employee (Id, Name, Salary, ManagerId) VALUES
(1, 'Alice', 50000, 3),
(2, 'Bob', 60000, 3),
(3, 'Charlie', 55000, NULL),
(4, 'David', 70000, 5),
(5, 'Eve', 65000, NULL),
(6, 'Frank', 72000, 4);


```

* solution here
``` sql
with
full_table
as
	(
  select e.Name as emp,m.Name as manager,e.Salary emp_sal , m.Salary as manager_sal
  from employee e
  left join employee m
  on e.ManagerId = m.Id
      )
SELECT emp
from full_table
where manager_sal< emp_sal
```


#### 2. Given a `Weather` table with columns `Id` (integer), `RecordDate` (date), and `Temperature` (integer), write a SQL query to retrieve the `Id` of all dates where the temperature is higher than the temperature of the previous date.



```sql
CREATE TABLE Weather (
    Id INT PRIMARY KEY,
    RecordDate DATE,
    Temperature INT
);

INSERT INTO Weather (Id, RecordDate, Temperature) VALUES
(1, '2024-06-01', 23),
(2, '2024-06-02', 27),
(3, '2024-06-03', 24),
(4, '2024-06-04', 29),
(5, '2024-06-05', 26);
```


* Solution here
``` sql
select Id
from
(
select Id,RecordDate,Temperature,lag(Temperature) over( order by Id) prev_temp
from Weather
 )
 where Temperature > prev_temp
```

#### 3. Given an `Activity` table with columns `player_id` (integer), `device_id` (integer), `event_date` (date), and `games_played` (integer), where `(player_id, event_date)` is the primary key, write an SQL query to report for each player and date, the total number of games played by that player until that date.


```sql
CREATE TABLE Activity (
    player_id INT,
    device_id INT,
    event_date DATE,
    games_played INT,
    PRIMARY KEY (player_id, event_date)
);

INSERT INTO Activity (player_id, device_id, event_date, games_played) VALUES
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(1, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5);
```

#### Expected Output

```sql
+-----------+------------+---------------------+
| player_id | event_date | games_played_so_far |
+-----------+------------+---------------------+
| 1         | 2016-03-01 | 5                   |
| 1         | 2016-05-02 | 11                  |
| 1         | 2017-06-25 | 12                  |
| 3         | 2016-03-02 | 0                   |
| 3         | 2018-07-03 | 5                   |
+-----------+------------+---------------------+
```



* Solution here
``` sql
select player_id,event_date,sum(games_played) over(partition by player_id order by event_date) running_sum
from activity
order by player_id,event_date
```

#### What is a dimension in terms of ML?


In the context of NLP, the concept of "dimension" can be understood through an analogy involving boxes. Imagine we have two boxes. As humans, we can use our senses to differentiate between them by seeing, weighing, and feeling the boxes. However, a machine requires numerical inputs to understand and differentiate objects.

Consider box 1 with dimensions (length, breadth, height) as (40, 30, 5) and box 2 with the same dimensions (40, 30, 5). For a machine learning model, these measurements are plotted on three axes (x, y, z), and both boxes will occupy the same point in this three-dimensional space. By measuring the distance between these points, the model will conclude that the boxes are identical.

Now, if we color box 1 black and box 2 white, humans can immediately perceive the difference. To enable a machine to differentiate based on color, we introduce another dimension, say 'c' for color. Suppose we assign numerical values to colors: 1 for black and -1 for white. Now, box 1 is represented as (40, 30, 5, 1) and box 2 as (40, 30, 5, -1). When plotted in this four-dimensional space, the points representing the boxes will no longer coincide. By measuring the distance between these points, the machine learning model can determine that the boxes are different based on the added dimension of color.

### Interms of Text fields

---

Let's imagine you have two sentences. We as humans can read and understand the sentences, identifying their meanings and differences through our cognitive abilities. However, a machine needs numerical representations to understand any text.

### Step 1: Initial Representation

Consider the following two sentences:

- Sentence 1: "The cat sat on the mat."
- Sentence 2: "The dog sat on the mat."

Initially, let's represent these sentences by counting the frequency of each word. This will form a simple Bag-of-Words (BoW) representation. Assume our vocabulary is constructed from both sentences, resulting in the following list of unique words (tokens):

```
["the", "cat", "sat", "on", "mat", "dog"]
```

We can now create a frequency vector for each sentence based on this vocabulary:

For Sentence 1:
```
[2, 1, 1, 1, 1, 0]
```
For Sentence 2:
```
[2, 0, 1, 1, 1, 1]
```

### Step 2: Dimension Representation

These vectors can be visualized as points in a six-dimensional space (since we have six unique words). The coordinates of each point correspond to the word frequencies in each sentence.

### Step 3: Adding More Dimensions

Now, let's extend our representation to capture more nuanced differences. Suppose we want to include information about the sentiment of the sentences. Assume we have a sentiment analysis model that assigns a sentiment score to each sentence: +1 for positive and -1 for negative.

- Sentence 1 (Neutral): Sentiment score 0
- Sentence 2 (Neutral): Sentiment score 0

Now our vectors become:

For Sentence 1:
```
[2, 1, 1, 1, 1, 0, 0]
```
For Sentence 2:
```
[2, 0, 1, 1, 1, 1, 0]
```

Next, let's add another dimension for sentence length. Assume the length of the sentence (number of words) is another feature:

- Sentence 1: Length 6
- Sentence 2: Length 6

Updating our vectors:

For Sentence 1:
```
[2, 1, 1, 1, 1, 0, 0, 6]
```
For Sentence 2:
```
[2, 0, 1, 1, 1, 1, 0, 6]
```

### Conclusion

By adding more features, we increase the dimensions in which the sentences are represented. Initially, in a lower-dimensional space (only word frequencies), the sentences might appear quite similar. However, by adding additional dimensions (sentiment score, sentence length, etc.), we can better capture the differences between them. This richer representation helps the machine learning model to understand and differentiate the sentences more effectively.

---

This explanation parallels your original explanation with boxes but uses text data to illustrate the concept of adding dimensions for better differentiation and understanding by a machine learning model.

## PANDAS AND NUMPY

#### BAG OF WORDS(BOW) is the simplest way to convert text data in to numbers(vectors) that machine can understand.

* Sentence is called as document
* every word is a token
* set of sentences is a corpus


Bag of Words (BoW) is a technique used in natural language processing (NLP) to convert text data into numerical representations for machine learning models. In BoW, a text is represented as a bag (multiset) of its words, disregarding grammar and word order but keeping multiplicity.

### How BoW Works:
1. **Tokenization:** Split the text into words (tokens).
2. **Vocabulary Creation:** Create a list of unique words (vocabulary) from all documents.
3. **Vectorization:** Convert each document into a vector based on the vocabulary, with each element representing the count of a specific word in that document.

### Example:
Consider the following two sentences:
1. "I love dogs"
2. "I love cats"

#### Step-by-Step:
1. **Tokenization:**
   - Sentence 1: ["I", "love", "dogs"]
   - Sentence 2: ["I", "love", "cats"]

2. **Vocabulary Creation:**
   - ["I", "love", "dogs", "cats"]

3. **Vectorization:**
   - Sentence 1: [1, 1, 1, 0] (1 "I", 1 "love", 1 "dogs", 0 "cats")
   - Sentence 2: [1, 1, 0, 1] (1 "I", 1 "love", 0 "dogs", 1 "cats")

The resulting vectors represent the text in a numerical format that can be used for further processing in machine learning models.

### Summary:
BoW converts text into a fixed-size numerical vector based on word counts, enabling machine learning algorithms to process text data. It is simple but ignores grammar and context.

### Limitations of BoW:

1. **Ignores Word Order:** Loses context and meaning by treating words independently.
2. **No Semantics:** Fails to capture the meaning or relationships between words.
3. **Sparse Representations:** Results in large, mostly empty vectors.
4. **Lack of Generalization:** Cannot handle unseen words effectively.
5. **Fixed Vocabulary:** Limited by the training corpus, missing new or rare words.

### Why More Advanced Models are Needed:

To address these issues, models like TF-IDF, Word Embeddings (Word2Vec, GloVe), and Transformer-based models (BERT, GPT) provide better context understanding, semantic representation, and efficient handling of large vocabularies.

#### BOW from Scratch

In [4]:
# Let's create a simple BOW model without using libraries.
## use chatgpt for hints
# Create three diff functions
# one for cleaning(removing punctuations, html tags non words)
# one for creating vocab 
# last for representing sentence using vectors
# USE aaply and LMBDA function wherevr applicable

import pandas as pd
import re


data = {
    "Sentences": [
        "HELLO world! This is a <div>sample</div> sentence.",
        "Learning PYTHON is fun. <p>Practice</p> makes perfect.",
        "Data Science with <a href='#'>pandas</a> is powerful!",
        "Machine learning involves <span>algorithms</span> and data.",
        "Clean your DATA: Remove <b>noise</b> and punctuations."
    ]
}

# Creating the DataFrame
df = pd.DataFrame(data)

def cleaning_lower(string):
    string = re.sub(r'<.*?>',' ',string)
    string = re.sub(r'[^A-Za-z]',' ',string)
    string = re.sub(r'\s+',' ',string)
    string = string.lower()
    return string

df['Sentences'] = df['Sentences'].apply(lambda x : cleaning_lower(x))
print('-----This is the cleaned text--------')
print(df)
print('\n\n\n')


def create_vocab(df,column_name):
     tokens_list = df[column_name].apply(lambda x: x.split())
     all_tokens = tokens_list.sum()  # Flatten the list of lists
     vocab = list(set(all_tokens))  # Get unique words
     return vocab

vocab = create_vocab(df,'Sentences')

print('----This is the vocab------')
print(vocab)
print('\n\n')

def create_vector(sentence,vocab):
    bow = [0]*len(vocab)
    for word in sentence.split():
        if word in vocab:
            index = vocab.index(word)
            bow[index] += 1
    return bow



df['vector_rep'] = df['Sentences'].apply(lambda x: create_vector(x,vocab))
print(df)   
    

-----This is the cleaned text--------
                                        Sentences
0          hello world this is a sample sentence 
1  learning python is fun practice makes perfect 
2           data science with pandas is powerful 
3  machine learning involves algorithms and data 
4  clean your data remove noise and punctuations 




----This is the vocab------
['noise', 'learning', 'practice', 'machine', 'science', 'makes', 'and', 'powerful', 'remove', 'python', 'a', 'perfect', 'your', 'clean', 'sample', 'punctuations', 'hello', 'world', 'involves', 'with', 'fun', 'data', 'sentence', 'pandas', 'is', 'algorithms', 'this']



                                        Sentences  \
0          hello world this is a sample sentence    
1  learning python is fun practice makes perfect    
2           data science with pandas is powerful    
3  machine learning involves algorithms and data    
4  clean your data remove noise and punctuations    

                                          