# Python for Data Science - Data Handling

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li>
    <a href="#data">Working with Data</a>
        <ul>
        <li>
            <a href="#read">Reading Files</a>
                <ul>
                <li><a href="#open">Open</a></li>
                </ul>
        </li>
        <li>
            <a href="#write">Writing Files</a>
                <ul>
                <li><a href="#append">Appending Files</a></li>
                </ul>
        </li>
        <li>
            <a href="#pd">Pandas</a>
        </li>
        <li>
            <a href="#np">Numpy</a>
                <ul>
                <li><a href="#ndarray">NumPy Array</a></li>
                <li><a href="#vec">Vectorized Computation</a></li>
                <li><a href="#dot">Dot Product (Matrix Multiplication)</a></li>
                </ul>
        </li>
        </ul>
    <li>    
    <a href="#collect">Data Collection</a>
        <ul>
        <li><a href="#api">API</a></li>
        <li><a href="#rest">REST API</a></li>
        <li><a href="#requests">Requests</a></li>
        <li><a href="#web">Web Scraping</a></li>
        </ul>
    </ul>
    </li>
</div>

<hr>

<a id="data"></a>
## **Working with data**

<a id="read"></a>
### Reading Files

Reading files with Python is straightforward and can be done in several ways depending on the file type and your needs.


<a id="open"></a>
#### Open

We can use built-in function <code>open('path/filename.txt', mode)</code>, the mode argument is optional.

The following table provides examples of different file modes.

|Mode|Meaning|Description|
|----|----------|-------|
|'r'|Read (text)|Default mode. Opens file for reading. File must exist|
|'w'|Write (text)|Opens file for writing. Overwrites file if it exists, or creates a new one|
|'a'|Append (text)|Opens file for writing. Appends to the end if the file exists|
|'r+'|Read and write (text)|Opens file for both reading and writing. File must exist|
|'rb'|Read (binary)|Opens file in binary read mode. Useful for images, videos, etc|
|'wb'|Write (binary)|Opens file in binary write mode. Overwrites or creates new file|
|'ab'|Append (binary)|Opens binary file for appending|
|'rb+'|Read and write (binary)|Opens binary file for both reading and writing. File must exist.|



Notes:

- **Text mode** reads/writes strings
- **Binary mode** reads/writes bytes (b'')
- Use **newline=''** when reading/writing CSV files to avoid issues with line breaks (especially on Windows).

In [4]:
import requests

# Read from remote file
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)

if response.status_code == 200: # Check if the request was successful
    text = response.text # text content of the file
    print("First 500 characters:")
    print(text[:500])  # Print first 500 characters
else:
    print("Failed to retrieve file.")

First 500 characters:
*** START OF THE PROJECT GUTENBERG EBOOK 11 ***

[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.   A Caucus-Race and a Long Tale
 CHAPTER IV.    The Rabbit Sends in a Little Bill
 CHAPTER V.     Advice from a Caterpillar
 CHAPTER VI.    Pig and Pepper
 CHAPTER VII.   A Mad Tea-Party
 CHAPTER VIII.  The Queen’s Croquet-Ground
 CHAPTER IX.    The


In [5]:
# Write to a local file alice.txt
with open("data/alice.txt", "w", encoding="utf-8") as file:
    file.write(text) # write the content to the file

print("Content saved to alice.txt")

Content saved to alice.txt


In [6]:
# Read from local file
with open("data/alice.txt", "r", encoding="utf-8") as file:
    content = file.read(100)
    print("First 100 characters from local file:")
    print(content)

    # read last line from the file
    print("Last line from local file:")
    for line in file:
        pass
    print(line.strip())  # Read and print the last line

    # Read all lines and save them in a list
    file.seek(0)  # Reset file pointer to the beginning
    lines = file.readlines()
    print("Total number of lines in alice.txt:", len(lines))
    # Print 500 to 505 lines from the file
    print("Lines 500 to 505 from local file:")
    for i in range(500, 506):
        print(lines[i].strip())

    # Print total number of characters in the file
    file.seek(0)
    print("Total number of characters in alice.txt:", len(content) + len(file.read()))


First 100 characters from local file:
*** START OF THE PROJECT GUTENBERG EBOOK 11 ***

[Illustration]




Alice’s Adventures in Wonderland
Last line from local file:
*** END OF THE PROJECT GUTENBERG EBOOK 11 ***
Total number of lines in alice.txt: 3384
Lines 500 to 505 from local file:

“I know what ‘it’ means well enough, when _I_ find a thing,” said the
Duck: “it’s generally a frog or a worm. The question is, what did the
archbishop find?”

The Mouse did not notice this question, but hurriedly went on, “‘—found
Total number of characters in alice.txt: 144796


In [7]:
print(file.name)
print(file.mode)
print(file.closed) # Check if the file is closed  

data/alice.txt
r
True


##### 🔍 `with open(...)` vs `file = open(...)` in Python

| Feature            | `with open(...) as file:`                      | `file = open(...)`                      |
|--------------------|------------------------------------------------|-----------------------------------------|
| File closing       | Automatically closed after the block        | Must call `file.close()` manually     |
| Error handling     | Safer: handles exceptions automatically     | Risk of file staying open on error    |
| Recommended        | Yes – Pythonic and clean                    | Not recommended for general use       |
| Syntax             | Concise and structured                      | More verbose and error-prone          |
| Best for           | Reading/writing safely                      | Quick one-time scripts or tests       |

##### Example 
```python
with open("example.txt", "r") as file:
    content = file.read()
# file is automatically closed here

file = open("example.txt", "r")
content = file.read()
file.close()  # You must remember to close it!
 ```

<a id="write"></a>
### Writing Files

In Python, you can write files using the built-in <code>open()</code> function in combination with the <code>write()</code> or <code>writelines()</code> methods.

- <code>'w'</code> mode creates the file if it doesn’t exist, and overwrites the exsiting content if it does.

In [8]:
with open("data/example.txt", "w") as filewrite:
    filewrite.write("Hello, world!\n")
    filewrite.write("This is a second line.\n")
    filewrite.write("This is a third line.\n")
    print("Content written to example.txt")
# Read from example.txt
with open("data/example.txt", "r") as filewrite:
    print("Content of example.txt:")
    for line in filewrite:
        print(line.strip())  # Print each line without extra newlines

Content written to example.txt
Content of example.txt:
Hello, world!
This is a second line.
This is a third line.


It's fairly ineffecient to open the file in a **w** mode and then reopening it in **r** to read any lines.

- <code>'w+'</code> mode allows you to write and read in the same block. Truncates the file. You must seek(0) to read from the beginning.

In [9]:
with open("data/example.txt", "w+") as filewrite:
    filewrite.write("Hello, world!\n")
    filewrite.write("This is a second line.\n")
    filewrite.write("This is a third line.\n")
    filewrite.seek(0)  # Reset file pointer to the beginning
    print("Content of example.txt:")
    for line in filewrite:
        print(line.strip())  # Print each line without extra newlines


Content of example.txt:
Hello, world!
This is a second line.
This is a third line.


- <code>'r+'</code> mode: read and write from begining, file must exists, no truncation.

In [10]:
# Read and write in 'r+' mode
with open("data/example.txt", "r+") as filewrite:
    content = filewrite.read()
    print("Content of example.txt in read mode:")
    print(content.strip())  # Print the entire content without extra newlines
    filewrite.write("\nThis is a new line added in read+ mode.\n")  # Write at the end
    filewrite.seek(0)  # Reset file pointer to the beginning
    print("Content of example.txt after writing in read+ mode:")
    for line in filewrite:
        print(line.strip())  # Print each line without extra newlines

Content of example.txt in read mode:
Hello, world!
This is a second line.
This is a third line.
Content of example.txt after writing in read+ mode:
Hello, world!
This is a second line.
This is a third line.

This is a new line added in read+ mode.


<a id="append"></a>
#### Appending Files

- <code>'a'</code> mode adds content to the end of the file without losing any of the exsiting data.
- <code>'a+'</code> mode allows you to append and read, keeps file intact. But you must seek(0) to read from the beginning.

In [11]:
with open("data/example.txt", "a") as filewrite:
    filewrite.write("This is an appended line.\n")
    print("Content appended to example.txt")
# Read from example.txt after appending
with open("data/example.txt", "r") as filewrite:
    print("Content of example.txt after appending:")
    for line in filewrite:
        print(line.strip())  # Print each line without extra newlines

Content appended to example.txt
Content of example.txt after appending:
Hello, world!
This is a second line.
This is a third line.

This is a new line added in read+ mode.
This is an appended line.


In [12]:
with open("data/example.txt", "a+") as filewrite:
    filewrite.write("This is another appended line.\n")
    filewrite.seek(0)  # Reset file pointer to the beginning
    print("Content of example.txt after another append:")
    for line in filewrite:
        print(line.strip())  # Print each line without extra newlines

Content of example.txt after another append:
Hello, world!
This is a second line.
This is a third line.

This is a new line added in read+ mode.
This is an appended line.
This is another appended line.


The following table shows the difference among **r+**, **w+**, and **a+**.


| Mode | Read | Write | File Must Exist | Truncates File | Write Starts At | Can Overwrite | Can Append |
|------|------|-------|------------------|----------------|------------------|----------------|-------------|
| `r+` | ✅    | ✅     | ✅ Yes           | ❌ No           | 🔼 Beginning (seekable) | ✅ Yes         | ✅ Yes (with `seek(0,2)`) |
| `w+` | ✅    | ✅     | ❌ No            | ✅ Yes          | 🔼 Beginning         | ✅ Yes         | ❌ No, truncate to zero length once open files |
| `a+` | ✅    | ✅     | ❌ No            | ❌ No           | 🔽 End only          | ❌ No          | ✅ Yes       |

---

##### 🧪 Behavior Examples

##### File content before:

```
1234567890
```

##### `r+` – Read & Write from beginning (no truncation)

```python
with open("example.txt", "r+") as f:
    f.write("abc")
```

**Result:**

```
abc4567890
```

---

##### `w+` – Truncate & Write from scratch

```python
with open("example.txt", "w+") as f:
    f.write("xyz")
```

**Result:**

```
xyz
```

---

##### `a+` – Always Append

```python
with open("example.txt", "a+") as f:
    f.write("+++")
```

**Result:**

```
1234567890+++
```

Even if you use `f.seek(0)`, writing still goes to the end.

---




Every time you open a file in Python, there’s an internal file pointer that marks where the next read or write will occur.

You can move this pointer using <code>seek(offset, whence)</code>. It changes the position by 'offset' bytes with respect to 'whence'. 

'whence' can take the value of 0 / 1 / 2 corresponding to begining / current position / end.

##### Appending examples using r+:




In [13]:
with open("data/append_example.txt", "w") as fileappend:
    fileappend.write("This is a new file for appending.\n")
    print("Content written to append_example.txt")

Content written to append_example.txt


In [14]:
# Append to append_example.txt using 'r+' mode
with open("data/append_example.txt", "r+") as fileappend:
    fileappend.seek(0, 2)  # Move to the end of the file
    fileappend.write("This is an appended line in r+ mode.\n")
    fileappend.seek(0)  # Reset file pointer to the beginning
    print("Content of append_example.txt after appending:")
    for line in fileappend:
        print(line.strip())  # Print each line without extra newlines

Content of append_example.txt after appending:
This is a new file for appending.
This is an appended line in r+ mode.


In [15]:
# 'w+' mode truncates the file to zero length before writing
with open("data/append_example.txt", "w+") as fileappend:
    fileappend.seek(0, 2)
    fileappend.write("This is a new line in w+ mode.\n") # Only this line will be written
    # The previous content is lost due to w+ mode
    fileappend.seek(0)  # Reset file pointer to the beginning
    print("Content of append_example.txt after writing in w+ mode:")
    for line in fileappend:
        print(line.strip())  

Content of append_example.txt after writing in w+ mode:
This is a new line in w+ mode.


<a id="pd"></a>
### Pandas


Pandas generally provide two data structures for manipulating data, They are:

- DataFrame: two-dimentional data structure, rows and columns
- Series: one-dimentional array of indexed data, such as every column in a dataframe



In [18]:
import pandas as pd

In [78]:
# create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 40, None, 45, 50],
    'City': ['NY', 'LA', 'NY', 'Chicago', 'LA', None, 'Chicago'],
    'Salary': [70000, 80000, None, 90000, 75000, 60000, 50000],
    'Is_Employed': [True, False, True, True, False, True, True],
    'Join_Date': pd.to_datetime(['2020-01-01', '2019-05-15', '2021-07-20', '2018-03-30', '2022-11-01', '2017-12-25', '2023-01-01'])
}
df = pd.DataFrame(data)

# Save DataFrame to a CSV file
df.to_csv('data/example_dataframe.csv', index=False)
# Read the CSV file into a DataFrame
df_read = pd.read_csv('data/example_dataframe.csv')
print("DataFrame read from CSV:")
df_read

DataFrame read from CSV:


Unnamed: 0,Name,Age,City,Salary,Is_Employed,Join_Date
0,Alice,25.0,NY,70000.0,True,2020-01-01
1,Bob,30.0,LA,80000.0,False,2019-05-15
2,Charlie,35.0,NY,,True,2021-07-20
3,David,40.0,Chicago,90000.0,True,2018-03-30
4,Eva,,LA,75000.0,False,2022-11-01
5,Frank,45.0,,60000.0,True,2017-12-25
6,Grace,50.0,Chicago,50000.0,True,2023-01-01


##### Exploring data

```python
df.head()         # First 5 rows
df.tail()         # Last 5 rows
df.shape          # (rows, columns)
df.info()         # Summary info
df.describe()     # Statistics summary
df.columns        # Column names
df.dtypes         # Data types
df.isnull().sum() # Count missing values per column
```

In [None]:
# Example
df_read.isnull().sum()  # Check for missing values

Name           0
Age            1
City           1
Salary         1
Is_Employed    0
Join_Date      0
dtype: int64

##### Data Selection

```python
df['Age']                # Select column (as series)
df[['Name', 'Age']]      # Select multiple columns (as dataframe)
df.loc[0]                # Row by label
df.iloc[0]               # Row by index
df.loc[0, 'Age']         # Specific cell
```

##### Filtering

```python
df[df['Age'] > 25]                  # Filter rows
df[(df['Age'] > 25) & (df['Is_Employed'] == True)]  # Multiple conditions
```

##### Modifying data

```python
df['Gender'] = ['F', 'M']      # Add new column
df['Age'] = df['Age'] + 1      # Modify column
df.drop('Gender', axis=1, inplace=True)  # Drop column
```

- axis=0: delete rows (default)
- axis=1: delete columns
- inplace=True: modify the original dataframe
- inplace=False: return a new dataframe (default)



In [80]:
# modify the DataFrame
# add a new column
df_read['Experience'] = [2, 5, 3, 10, 1, 8, 4]  # Adding years of experience
# modify an existing column
df_read['Salary'] = df_read['Salary'].fillna(0)  # Fill missing salaries with 0
# drop a column
df_read.drop(columns=['Is_Employed'], axis=1)  # Drop the 'Is_Employed' column

Unnamed: 0,Name,Age,City,Salary,Join_Date,Experience
0,Alice,25.0,NY,70000.0,2020-01-01,2
1,Bob,30.0,LA,80000.0,2019-05-15,5
2,Charlie,35.0,NY,0.0,2021-07-20,3
3,David,40.0,Chicago,90000.0,2018-03-30,10
4,Eva,,LA,75000.0,2022-11-01,1
5,Frank,45.0,,60000.0,2017-12-25,8
6,Grace,50.0,Chicago,50000.0,2023-01-01,4


In [56]:
df_read

Unnamed: 0,Name,Age,City,Salary,Is_Employed,Join_Date,Experience
0,Alice,25.0,NY,70000.0,True,2020-01-01,2
1,Bob,30.0,LA,80000.0,False,2019-05-15,5
2,Charlie,35.0,NY,0.0,True,2021-07-20,3
3,David,40.0,Chicago,90000.0,True,2018-03-30,10
4,Eva,,LA,75000.0,False,2022-11-01,1
5,Frank,45.0,,60000.0,True,2017-12-25,8
6,Grace,50.0,Chicago,50000.0,True,2023-01-01,4


##### Handling Missing Data

```python
df.dropna()              # Drop rows with missing values
df.fillna(0)             # Fill missing with 0
df['Age'].fillna(df['Age'].mean())  # Fill with mean
```

##### Sorting Data

<code>sort_values()</code>

In [57]:
# sort the DataFrame by 'Age'
df_sorted = df_read.sort_values(by='Age', ascending=True)
df_sorted

Unnamed: 0,Name,Age,City,Salary,Is_Employed,Join_Date,Experience
0,Alice,25.0,NY,70000.0,True,2020-01-01,2
1,Bob,30.0,LA,80000.0,False,2019-05-15,5
2,Charlie,35.0,NY,0.0,True,2021-07-20,3
3,David,40.0,Chicago,90000.0,True,2018-03-30,10
5,Frank,45.0,,60000.0,True,2017-12-25,8
6,Grace,50.0,Chicago,50000.0,True,2023-01-01,4
4,Eva,,LA,75000.0,False,2022-11-01,1


##### Grouping and Aggregating

<code>groupby(column)[]</code>

<code>groupby(column).agg({})</code>: multiple aggregations for one column

In [62]:
# group by 'City' and calculate the average salary
df_grouped = df_read.groupby('City')[['Salary','Age']].mean().reset_index()
df_grouped

Unnamed: 0,City,Salary,Age
0,Chicago,70000.0,45.0
1,LA,77500.0,30.0
2,NY,35000.0,30.0


In [61]:
# multiple aggregations
df_agg = df_read.groupby('City').agg({
    'Salary': ['median', 'sum'],
    'Age': ['min', 'max']
}).reset_index()
df_agg

Unnamed: 0_level_0,City,Salary,Salary,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,median,sum,min,max
0,Chicago,70000.0,140000.0,40.0,50.0
1,LA,77500.0,155000.0,30.0,30.0
2,NY,35000.0,70000.0,25.0,35.0


##### Merging / Concatenation

- Merging(similar to SQL join): <code>pd.merge(left, right, how='', on='', left_on='', right_on='')</code>
  
|Parameter|Description|
|---------|-----------|
|left|The first DataFrame (on the left side of the merge)|
|right|The second DataFrame (on the right side of the merge)|
|on|Column name(s) present in both DataFrames to merge on (same column name in both)|
|left_on|Column name(s) in the left DataFrame to use as merge key|
|right_on|Column name(s) in the right DataFrame to use as merge key|
|how|Type of merge: 'inner', 'left', 'right', 'outer' (explained below)|

- Same column name in both tables → use on
- Different column names → use left_on and right_on

|how|Description|
|---|---|
|'inner'|Only matching rows in both DataFrames|
|'left'|All rows from left DataFrame + matches from right|
|'right'|All rows from right DataFrame + matches from left|
|'outer'|All rows from both DataFrames|

---

- Concatenation: <code>pd.concat(objs, axis=0, ignore_index=False, ...)</code>

|Parameter|Description|
|---------|-----------|
|objs|A list or tuple of DataFrames to concatenate (e.g., [df1, df2])|
|axis|0 (default) for vertical (row-wise) concat, 1 for horizontal (column-wise) concat|
|ignore_index|If True, reset the index in the result (default is False)|


In [63]:
# merge two DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'David'],
    'City': ['NY', 'LA', 'Chicago']
})
df_merged = pd.merge(df1, df2, on='Name', how='outer')
df_merged


Unnamed: 0,Name,Age,City
0,Alice,25.0,NY
1,Bob,30.0,LA
2,Charlie,35.0,
3,David,,Chicago


In [None]:
# concatenate two DataFrames without resetting the index
df_concat = pd.concat([df1, df2], ignore_index=False)
df_concat

Unnamed: 0,Name,Age,City
0,Alice,25.0,
1,Bob,30.0,
2,Charlie,35.0,
0,Alice,,NY
1,Bob,,LA
2,David,,Chicago


In [72]:
# concatenate two DataFrames with resetting the index
df_concat_reset = pd.concat([df1, df2], ignore_index=True)
df_concat_reset

Unnamed: 0,Name,Age,City
0,Alice,25.0,
1,Bob,30.0,
2,Charlie,35.0,
3,Alice,,NY
4,Bob,,LA
5,David,,Chicago


In [73]:
# concatenate two dataframes column-wise
df_concat_col = pd.concat([df1, df2], axis=1)
df_concat_col

Unnamed: 0,Name,Age,Name.1,City
0,Alice,25,Alice,NY
1,Bob,30,Bob,LA
2,Charlie,35,David,Chicago


##### Apply & Lambda (one-time function)

<code>apply()</code> is a Pandas method that applies a function to:
- A Series (```df['col'].apply()```) → element-wise
- A DataFrame (```df.apply()```) → row-wise or column-wise

<code>lambda</code> is a small anonymous function defined using:

- ```lambda args: expression```

In [None]:
# apply conditional lambda function to a column
df_read["Group"] = df_read["Experience"].apply(lambda x: 'Junior' if x < 3 else 'Senior')
print("DataFrame after applying conditional lambda function:")
print(df_read[['Name','Experience','Group']])

DataFrame after applying lambda function:
      Name  Experience   Group
0    Alice           2  Junior
1      Bob           5  Senior
2  Charlie           3  Senior
3    David          10  Senior
4      Eva           1  Junior
5    Frank           8  Senior
6    Grace           4  Senior


In [88]:
# apply to multiple columns
df_read['Category'] = df_read.apply(lambda x: 'Low' if x['Salary'] < 70000 and x['Experience'] > 3 else 'Normal', axis=1)
print("DataFrame after applying to multiple columns:")
print(df_read[['Name', 'Salary', 'Experience', 'Category']])

DataFrame after applying to multiple columns:
      Name   Salary  Experience Category
0    Alice  70000.0           2   Normal
1      Bob  80000.0           5   Normal
2  Charlie      0.0           3   Normal
3    David  90000.0          10   Normal
4      Eva  75000.0           1   Normal
5    Frank  60000.0           8      Low
6    Grace  50000.0           4      Low


<a id="np"></a>
### Numpy

**NumPy** (Numerical Python) is the **foundation of numerical and scientific computing in Python**. It offers:
- A powerful object called ```ndarray``` — a multi-dimensional array
- Mathematical functions for operations on arrays
- Tools for linear algebra, statistics, random number generation, and more

In [90]:
import numpy as np

<a id="ndarray"></a>
#### NumPy Array (ndarray)

In [102]:
# Create ndarray
# 1-dimensional array
arr_1d = np.array([1, 2, 3, 4, 5])
print("Original 1D ndarray:")
print(arr_1d)

# 2-dimensional array
arr_2d = np.array([[1, 2], [4, 5], [7, 8]])
print("Original 2D ndarray:")
print(arr_2d)

# 3-dimensional array
arr_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print("Original 3D ndarray:")
print(arr_3d)



Original 1D ndarray:
[1 2 3 4 5]
Original 2D ndarray:
[[1 2]
 [4 5]
 [7 8]]
Original 3D ndarray:
[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


In [None]:
# size of the array 
# The number of elements in the array
print("Size of 1D array:", arr_1d.size)
print("Size of 2D array:", arr_2d.size)
print("Size of 3D array:", arr_3d.size)

Size of 1D array: 5
Size of 2D array: 6
Size of 3D array: 12


##### Indexing Guide

```python
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])  # shape = (3, 3) -> (row, column)
```

X[1, 2]  → 6

X[0:2, :] → [[1, 2, 3],
              [4, 5, 6]]

X[1, :] → [4, 5, 6]   # 1st row

X[:, 2] → [3, 6, 9]   # 2nd column

In [123]:
X=np.array([[1,0,1],[2,2,2]]) 

out=X[0:2,2]
print("Output of slicing the array X:")
print(out)

Output of slicing the array X:
[1 2]


##### Shape and Reshape Array

- .shape tells you the dimensions of the array.
- Reshape allows you to change the shape without changing the data.

In [103]:
# shape of the array
print("Shape of 1D array:", arr_1d.shape)
print("Shape of 2D array:", arr_2d.shape)
print("Shape of 3D array:", arr_3d.shape)

Shape of 1D array: (5,)
Shape of 2D array: (3, 2)
Shape of 3D array: (2, 2, 3)


In [107]:
# Reshape ndarray
arr_reshaped = arr_1d.reshape((5, 1))  # Reshape 1D to 2D
print("Reshaped 1D array to 2D:")
print(arr_reshaped)
# Reshape 2D to 3D
arr_reshaped_3d = arr_2d.reshape((3, 1, 2))  # Reshape 2D to 3D
print("Reshaped 2D array to 3D:")
print(arr_reshaped_3d)


Reshaped 1D array to 2D:
[[1]
 [2]
 [3]
 [4]
 [5]]
Reshaped 2D array to 3D:
[[[1 2]]

 [[4 5]]

 [[7 8]]]


<a id="vec"></a>
#### Vectorized Computation

This is a core power of NumPy.

Definition: Perform operations on entire arrays without writing loops. The operations are applied element-by-element at lightning speed.

Let's say you have weight of 3 people in kg, you can convert them to lb with vectorized computation.
```python
kg = np.array([48, 65, 54])
lb = kg * 2.2
```

In [None]:
# Vectorized addition
arr_add = arr_1d + 10  # Add 10 to each element
print("Array after adding 10 to each element:")
print(arr_add)

# Vectorized multiplication
arr_mul = arr_1d * arr_1d  # Element-wise multiplication
print("Array after element-wise multiplication:")
print(arr_mul)

Array after adding 10 to each element:
[11 12 13 14 15]
Array after element-wise multiplication:
[ 1  4  9 16 25]


<a id="dot"></a>
#### Dot Product (Matrix Multiplication)

np.dot() is a NumPy function used to compute the dot product (or inner product) of two arrays.

##### Two Vectors (1D × 1D)


```python
a = [a1, a2, a3]
b = [b1, b2, b3]
dot(a, b) = a1*b1 + a2*b2 + a3*b3
```


In [111]:
# dot product - for 1D arrays
print(arr_1d)
arr_dot = np.dot(arr_1d, arr_1d)  # Dot product of 1D array with itself
print("Dot product of 1D array with itself:")
print(arr_dot)

[1 2 3 4 5]
Dot product of 1D array with itself:
55


##### Matrix × Vector (e.g. 2D × 1D)

Matrix (m×n) × Vector (n,) → Vector (m,)

*Apply vector to every row in the matrix

In [None]:
# Matrix and vector multiplication
arr_matrix = np.array([[1, 2], [3, 4], [5, 6]])
arr_vector = np.array([1, 2])
arr_matrix_vector_mul = np.dot(arr_matrix, arr_vector)  # Matrix-vector multiplication
print("Matrix-vector multiplication result:")
# 1*1 + 2*2
# 3*1 + 4*2
# 5*1 + 6*2
print(arr_matrix_vector_mul)

Matrix-vector multiplication result:
[ 5 11 17]


##### Matrix × Matrix

Matrix A shape: (m × n)

Matrix B shape: (n × p)

→ Result: shape (m × p)

You are multiplying each row of a matrix on each column of another matrix to generate a new set of values.

In [119]:
# Matrix multiplication - for 2D arrays
print("2D array:\n", arr_2d) # 3 x 2 array
print("Transpose of 2D array:\n", arr_2d.T)  # 2 x 3 array (Transpose of the 2D array)

# Matrix multiplication of 2D array with its transpose
arr_matmul = np.dot(arr_2d, arr_2d.T) # 3 x 3 array 
print("Matrix multiplication of 2D array with its transpose:")
# [[1*1 + 2*2, 1*4 + 2*5, 1*7 + 2*8],
#  [4*1 + 5*2, 4*4 + 5*5, 4*7 + 5*8],
#  [7*1 + 8*2, 7*4 + 8*5, 7*7 + 8*8]]
print(arr_matmul)

2D array:
 [[1 2]
 [4 5]
 [7 8]]
Transpose of 2D array:
 [[1 4 7]
 [2 5 8]]
Matrix multiplication of 2D array with its transpose:
[[  5  14  23]
 [ 14  41  68]
 [ 23  68 113]]


<a id="collect"></a>
## **Data Collection**

<a id="api"></a>
### API

API (Application Programming Interface) is a set of rules that allows software applications to communicate with each other. It enables access to data or services from another application.

Pandas/NumPy are both APIs, each of them is set of software components.

<a id="rest"></a>
### REST APIs

Rest APIs function by sending a request to internet, the request is communicated via HTTP message. The HTTP message usually contains a JSON file.

In [2]:
# REST API example
import requests
# Define the API endpoint
api_url = "https://jsonplaceholder.typicode.com/posts"
# Make a GET request to the API
response = requests.get(api_url)
# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    print("First 2 posts from the API:")
    for post in data[:2]:  # Print first 2 posts
        print(f"Title: {post['title']}\nBody: {post['body']}\n")

First 2 posts from the API:
Title: sunt aut facere repellat provident occaecati excepturi optio reprehenderit
Body: quia et suscipit
suscipit recusandae consequuntur expedita et cum
reprehenderit molestiae ut ut quas totam
nostrum rerum est autem sunt rem eveniet architecto

Title: qui est esse
Body: est rerum tempore vitae
sequi sint nihil reprehenderit dolor beatae ea dolores neque
fugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis
qui aperiam non debitis possimus qui neque nisi nulla



<a id="requests"></a>
### Requests

Requests is a Python library for making HTTP requests. It simplifies sending HTTP requests and handling responses in Python.

Common HTTP methods:
- **GET**: Retrieve data from a server.
- **POST**: Send data to the server, usually to create or update resources.
- **PUT**: Update or replace a resource on the server.
- **DELETE**: Delete a resource on the server.

In [13]:
# Request to create a new post
new_post = {
    "title": "foo",
    "body": "bar",
    "userId": 1
}
# Make a POST request to create a new post
response_post = requests.post(api_url, json=new_post)
# Check if the POST request was successful
if response_post.status_code == 201:  # 201 Created
    created_post = response_post.json()
    print("Created Post:")
    print(f"ID: {created_post['id']}\nTitle: {created_post['title']}\nBody: {created_post['body']}")
else:
    print("Failed to create post. Status code:", response_post.status_code)

print(response_post.json()) 


Created Post:
ID: 101
Title: foo
Body: bar
{'title': 'foo', 'body': 'bar', 'userId': 1, 'id': 101}


In [8]:
# Make a PUT request to update an existing post
post_id = 1  # ID of the post to update
update_post = {
    "id": post_id,
    "title": "updated title",
    "body": "updated body",
    "userId": 1
}
response_put = requests.put(f"{api_url}/{post_id}", json=update_post)
# Check if the PUT request was successful
if response_put.status_code == 200:  # 200 OK
    updated_post = response_put.json()
    print("Updated Post:")
    print(f"ID: {updated_post['id']}\nTitle: {updated_post['title']}\nBody: {updated_post['body']}")
else:
    print("Failed to update post. Status code:", response_put.status_code)


Updated Post:
ID: 1
Title: updated title
Body: updated body


In [9]:
# Make a DELETE request to delete a post
response_delete = requests.delete(f"{api_url}/{post_id}")
# Check if the DELETE request was successful
if response_delete.status_code == 200:  # 200 OK
    print(f"Post with ID {post_id} deleted successfully.")
else:
    print(f"Failed to delete post with ID {post_id}. Status code:", response_delete.status_code)


Post with ID 1 deleted successfully.


<a id="web"></a>
### Web Scraping

In [None]:

url = "https://remoteok.com/"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

soup = BeautifulSoup(response.text, "html.parser")
for job in soup.find_all("h2"):  # 找所有职位标题
    print(job.text.strip())      

find a remote job
work from anywhere
Senior Fullstack Software Engineer
Blotato
Salary and compensation
Benefits
How do you apply?


In [22]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

keyword = ["SAS", "Statistical", "Programmer", "Analyst", "Data", "Scientist", "Programming"]

url = "https://remoteok.com/remote-dev-jobs"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

jobs = []
for row in soup.find_all("tr", class_="job"):
    title = row.find("h2").text.strip() if row.find("h2") else None
    company = row.find("h3").text.strip() if row.find("h3") else None
    location = row.find("div", class_="location").text.strip() if row.find("div", class_="location") else "Remote"
    link = "https://remoteok.com" + row.get("data-href") if row.get("data-href") else None

    if any(kw.lower() in title.lower() for kw in keyword):
        jobs.append({
            "Title": title,
            "Company": company,
            "Location": location,
            "Link": link
        })

df = pd.DataFrame(jobs)
print(df.head())

                Title          Company          Location  \
0  Lead Data Engineer  Open Architects  🇺🇸 United States   

                                                Link  
0  https://remoteok.com/remote-jobs/remote-lead-d...  


In [30]:
url = "https://books.toscrape.com/catalogue/page-1.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

books = []

for item in soup.find_all("article", class_="product_pod"):
    title = item.h3.a["title"]
    price = item.find("p", class_="price_color").text
    availability = item.find("p", class_="instock availability").text.strip()
    rating = item.p["class"][1]  # e.g. "Three"

    books.append({
        "Title": title,
        "Price": price,
        "Availability": availability,
        "Rating": rating
    })

df = pd.DataFrame(books)
print(df.head())

                                   Title    Price Availability Rating
0                   A Light in the Attic  Â£51.77     In stock  Three
1                     Tipping the Velvet  Â£53.74     In stock    One
2                             Soumission  Â£50.10     In stock    One
3                          Sharp Objects  Â£47.82     In stock   Four
4  Sapiens: A Brief History of Humankind  Â£54.23     In stock   Five
