# Data Mining - Lab - 2

#  Numpy  &  Perform Data Exploration with Pandas

-------------------------------------------------------------------------------
## Numpy

1) NumPy (Numerical Python) is a powerful open-source library in Python used for numerical and scientific computing.<br>
2) It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently.<br>
3) NumPy is highly optimized and written in C, making it much faster than using regular Python lists for numerical operations.<br>
4) It serves as the foundation for many other Python libraries in data science and machine learning, like pandas, TensorFlow, and scikit-learn.<br>
5) With features like broadcasting, vectorization, and integration with C/C++ code, NumPy allows for cleaner and faster code in numerical computations.<br>



### Step 1. Import the Numpy library

In [3]:
import numpy as np



### Step 2. Create a 1D array of numbers

In [7]:
a=np.arange(11)
print(a)
print(type(a))

[ 0  1  2  3  4  5  6  7  8  9 10]
<class 'numpy.ndarray'>


In [13]:
a=np.arange(2,10)
a

array([2, 3, 4, 5, 6, 7, 8, 9])

### Step 3. Reshape 1D to 2D Array

In [21]:
a=np.arange(20).reshape(4,5)
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

### Step 4. Create a Linspace array

In [29]:
np.linspace(1,8,15)

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. ,
       7.5, 8. ])

### Step 5. Create a Random Numbered Array

In [33]:
np.random.rand(4)

array([0.9854443 , 0.69467579, 0.67278284, 0.16217519])

In [35]:
np.random.rand(3,5)

array([[0.57246425, 0.22183005, 0.52415261, 0.09343574, 0.75516053],
       [0.72191723, 0.70362422, 0.91003873, 0.13085668, 0.44680972],
       [0.87936433, 0.25003088, 0.22225081, 0.46951573, 0.19178872]])

### Step 6. Create a Random Integer Array

In [39]:
np.random.randint(20,90,size=5)

array([65, 58, 65, 81, 41])

In [45]:
np.random.randint(20,90,size=(4,2))

array([[29, 74],
       [68, 29],
       [37, 83],
       [29, 24]])

### Step 7. Create a 1D Array and get Max,Min,ArgMax,ArgMin

In [47]:
arr=np.random.randint(20,90,size=5)
arr

array([76, 42, 28, 38, 82])

In [49]:
arr.max()

82

In [51]:
arr.min()

28

In [53]:
arr.argmax()

4

In [55]:
arr.argmin()

2

### Step 8. Indexing in 1D Array

In [103]:
arr[4]

82

In [105]:
arr[1:4]

array([42, 28, 38])

### Step 9. Indexing in 2D Array

In [77]:
arr2d=np.array([[101,102,103,104,105],[201,202,203,204,205],[301,302,303,304,305],[401,402,403,404,405]])
arr2d

array([[101, 102, 103, 104, 105],
       [201, 202, 203, 204, 205],
       [301, 302, 303, 304, 305],
       [401, 402, 403, 404, 405]])

In [79]:
arr2d[1::2]

array([[201, 202, 203, 204, 205],
       [401, 402, 403, 404, 405]])

In [81]:
arr2d[::2,::2]

array([[101, 103, 105],
       [301, 303, 305]])

In [83]:
arr2d[1:3:,1:4:]

array([[202, 203, 204],
       [302, 303, 304]])

### Step 10. Conditional Selection

In [115]:
arr2d[arr2d>4]

array([101, 102, 103, 104, 105, 201, 202, 203, 204, 205, 301, 302, 303,
       304, 305, 401, 402, 403, 404, 405])

In [89]:
arr2d[arr2d>6]

array([101, 102, 103, 104, 105, 201, 202, 203, 204, 205, 301, 302, 303,
       304, 305, 401, 402, 403, 404, 405])

### 🔥You did it! 10 exercises down — you're on fire! 🔥

## Pandas



### Step 1. Import the necessary libraries

In [95]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [121]:
users=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')
users

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043
2,3|23|M|writer|32067
3,4|24|M|technician|43537
4,5|33|F|other|15213
...,...
938,939|26|F|student|33319
939,940|32|M|administrator|02215
940,941|20|M|student|97229
941,942|48|F|librarian|78209


### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [178]:
users=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',sep="|",index_col='user_id')
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


### Step 4. See the first 25 entries

In [130]:
users.head(25)

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043
2,3|23|M|writer|32067
3,4|24|M|technician|43537
4,5|33|F|other|15213
5,6|42|M|executive|98101
6,7|57|M|administrator|91344
7,8|36|M|administrator|05201
8,9|29|M|student|01002
9,10|53|M|lawyer|90703


### Step 5. See the last 10 entries

In [132]:
users.tail(10)

Unnamed: 0,user_id|age|gender|occupation|zip_code
933,934|61|M|engineer|22902
934,935|42|M|doctor|66221
935,936|24|M|other|32789
936,937|48|M|educator|98072
937,938|38|F|technician|55038
938,939|26|F|student|33319
939,940|32|M|administrator|02215
940,941|20|M|student|97229
941,942|48|F|librarian|78209
942,943|22|M|student|77841


### Step 6. What is the number of observations in the dataset?

In [154]:
users.shape[0]

943

### Step 7. What is the number of columns in the dataset?

In [187]:
users.shape[1]

4

### Step 8. Print the name of all the columns.

In [185]:
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

### Step 9. How is the dataset indexed?

In [183]:
users.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
      dtype='int64', name='user_id', length=943)

### Step 10. What is the data type of each column?

In [181]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

### Step 11. Print only the occupation column

In [189]:
users['occupation']

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

### Step 12. How many different occupations are in this dataset?

In [193]:
users.occupation.nunique()

21

### Step 13. What is the most frequent occupation?

In [201]:
users.occupation.value_counts().head(1)

occupation
student    196
Name: count, dtype: int64

### Step 14. Summarize the DataFrame.

In [203]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


### Step 15. Summarize all the columns

In [205]:
users.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


### Step 16. Summarize only the occupation column

In [17]:
users=pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',sep="|")
print(users['occupation'].value_counts())

occupation
student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
homemaker          7
doctor             7
Name: count, dtype: int64


### Step 17. What is the mean age of users?

In [None]:
users.age.min()

### Step 18. What is the age with least occurrence?

In [197]:
users.age.value_counts().tail()

age
7     1
66    1
11    1
10    1
73    1
Name: count, dtype: int64

### You're not just learning, you're mastering it. Keep aiming higher! 🚀