# Data Operations using R/Python

## Operations to Perform:
1. **Read data from different formats** (like CSV, XLS)
2. **Find Shape of Data**
3. **Find Missing Values**
4. **Find Data Type of Each Column**
5. **Finding Out Zeros**
6. **Indexing and Selecting Data**, Sorting Data
7. **Describe Attributes of Data**:
   - Checking data types of each column
8. **Counting Unique Values of Data**
9. **Format of Each Column**
10. **Converting Variable Data Type** (e.g., from long to short, and vice versa)


In [1]:
import pandas as pd

In [3]:
# a) read data from different formats (like csv, xls)
df = pd.read_csv("Customers - Customers.csv")
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
0,1,Male,19,15000,39,Healthcare,1,4
1,2,Male,21,35000,81,Engineer,3,3
2,3,Female,20,86000,6,Engineer,1,1
3,4,Female,23,59000,77,Lawyer,0,2
4,5,Female,31,38000,40,Entertainment,2,6


In [4]:
# b) Find Shape of Data
df.shape

(2000, 8)

In [5]:
# c) Find Missing Values
df.isnull().sum()

CustomerID                 0
Gender                     0
Age                        0
Annual Income ($)          0
Spending Score (1-100)     0
Profession                35
Work Experience            0
Family Size                0
dtype: int64

In [7]:
# d) Find data type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              2000 non-null   int64 
 1   Gender                  2000 non-null   object
 2   Age                     2000 non-null   int64 
 3   Annual Income ($)       2000 non-null   int64 
 4   Spending Score (1-100)  2000 non-null   int64 
 5   Profession              1965 non-null   object
 6   Work Experience         2000 non-null   int64 
 7   Family Size             2000 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 125.1+ KB


In [None]:
# e) Finding out Zero's
df.isnull().sum()

In [8]:
# f) Indexing and selecting data, sort data

# a) Indexing
df.iloc[4]

CustomerID                            5
Gender                           Female
Age                                  31
Annual Income ($)                 38000
Spending Score (1-100)               40
Profession                Entertainment
Work Experience                       2
Family Size                           6
Name: 4, dtype: object

In [9]:
# b) Selecting
df[["Age","Gender","Profession"]]

Unnamed: 0,Age,Gender,Profession
0,19,Male,Healthcare
1,21,Male,Engineer
2,20,Female,Engineer
3,23,Female,Lawyer
4,31,Female,Entertainment
...,...,...,...
1995,71,Female,Artist
1996,91,Female,Doctor
1997,87,Male,Healthcare
1998,77,Male,Executive


In [10]:
# c) Sorting
df.sort_values(by="Age", ascending=False)

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
1103,1104,Female,99,103706,50,Entertainment,1,2
1524,1525,Female,99,150782,18,Executive,8,2
1629,1630,Male,99,162762,52,Healthcare,1,1
361,362,Male,99,63364,61,Entertainment,1,2
1322,1323,Female,99,144176,74,Marketing,7,2
...,...,...,...,...,...,...,...,...
1271,1272,Female,0,61228,81,Entertainment,1,6
559,560,Male,0,151298,89,Artist,0,6
1583,1584,Female,0,120899,7,Marketing,2,6
211,212,Female,0,22000,92,Artist,2,1


In [11]:
# g) Describe attributes of data, checking data types of each column
df.describe()

Unnamed: 0,CustomerID,Age,Annual Income ($),Spending Score (1-100),Work Experience,Family Size
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1000.5,48.96,110731.8215,50.9625,4.1025,3.7685
std,577.494589,28.429747,45739.536688,27.934661,3.922204,1.970749
min,1.0,0.0,0.0,0.0,0.0,1.0
25%,500.75,25.0,74572.0,28.0,1.0,2.0
50%,1000.5,48.0,110045.0,50.0,3.0,4.0
75%,1500.25,73.0,149092.75,75.0,7.0,5.0
max,2000.0,99.0,189974.0,100.0,17.0,9.0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              2000 non-null   int64 
 1   Gender                  2000 non-null   object
 2   Age                     2000 non-null   int64 
 3   Annual Income ($)       2000 non-null   int64 
 4   Spending Score (1-100)  2000 non-null   int64 
 5   Profession              1965 non-null   object
 6   Work Experience         2000 non-null   int64 
 7   Family Size             2000 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 125.1+ KB


In [14]:
df.value_counts()

CustomerID  Gender  Age  Annual Income ($)  Spending Score (1-100)  Profession     Work Experience  Family Size
1           Male    19   15000              39                      Healthcare     1                4              1
1346        Male    36   75178              41                      Entertainment  1                7              1
1344        Female  82   157994             92                      Artist         1                1              1
1343        Female  45   114741             96                      Artist         0                5              1
1342        Male    38   76103              62                      Executive      9                3              1
                                                                                                                  ..
662         Male    53   148759             52                      Doctor         7                7              1
661         Male    49   81084              48                      E

In [15]:
df.value_counts().sum()

1965