# Preprocessing and DataFrames

## Preprocessin
Preprocessing is the process of cleaning and organizing raw data before using it for analysis or machine learning. The main goal is to improve the quality of the data so it can be effectively used for modeling and analysis.s. Common preprocessing steps include:

1. **Handling Missing Data**: Filling in or removing missing values.
2. **Data Normalization/Standardization**: Scaling features to have consistent ranges.
3. **Data Cleaning**: Removing duplicates, correcting errors, or addressing outliers.
4. **Encoding Categorical Variables**: Converting categories into numerical form (e.g., one-hot encoding or label encoding).
5. **Feature Selection/Engineering**: Selecting or creating new features that improve model performnce.

### 
DataFrames are two-dimensional, labeled data structures commonly used in data analysis. They are like tables or spreadsheets and are particularly powerful for handling structured data. Each column in a DataFrame can be of a different data type (e.g., integers, floats, strings).ts, strings).

In Python, DataFrames are primarily used with the **pandas** library. Key characteristics of DataFrames include:

- They consist of rows and columns, where each column represents a feature or variable, and each row represents an observation or record.
- DataFrames provide powerful operations to filter, group, merge, and aggregate data efficiently.
- They can be created from various sources, such as CSV files, databases, or raw data stored in lists or diction

Below is an example of creating and using a DataFrame with pandas:

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Hamna', 'Fizza', 'Yumna'],
    'Age': [19, 20, 21],
    'City': ['Islamabad', 'Faisalabad', 'Multan']
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Access a column
print(df['Age'])

# Filter rows
filtered_df = df[df['Age'] > 20]
print(filtered_df)
_df = df[df['Age'] > 28]
print(filtered_df)


In [33]:
import pandas as pd

# Read CSV file into pandas DataFrame

In [34]:
df = pd.read_csv('data.csv')

## This Will show first 5 Rows Dataset

In [35]:
df.head()

Unnamed: 0,Month,Python Worldwide(%),JavaScript Worldwide(%),Java Worldwide(%),C# Worldwide(%),PhP Worldwide(%),Flutter Worldwide(%),React Worldwide(%),Swift Worldwide(%),TypeScript Worldwide(%),Matlab Worldwide(%)
0,2004-01,30,98,96,76,100,6,1,9,2,78
1,2004-02,29,98,97,86,99,6,2,9,1,91
2,2004-03,28,100,100,87,97,5,2,9,2,99
3,2004-04,28,98,97,89,100,6,1,9,2,95
4,2004-05,28,91,99,84,92,6,2,10,3,86


## Show Rows And Columns of Dataset

In [36]:
df.shape

(249, 11)

## Information about Dataset

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Month                    249 non-null    object
 1   Python Worldwide(%)      249 non-null    int64 
 2   JavaScript Worldwide(%)  249 non-null    int64 
 3   Java Worldwide(%)        249 non-null    int64 
 4   C# Worldwide(%)          249 non-null    int64 
 5   PhP Worldwide(%)         249 non-null    int64 
 6   Flutter Worldwide(%)     249 non-null    int64 
 7   React Worldwide(%)       249 non-null    int64 
 8   Swift Worldwide(%)       249 non-null    int64 
 9   TypeScript Worldwide(%)  249 non-null    int64 
 10  Matlab Worldwide(%)      249 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 21.5+ KB


## Show null values

In [38]:
df.isnull()

Unnamed: 0,Month,Python Worldwide(%),JavaScript Worldwide(%),Java Worldwide(%),C# Worldwide(%),PhP Worldwide(%),Flutter Worldwide(%),React Worldwide(%),Swift Worldwide(%),TypeScript Worldwide(%),Matlab Worldwide(%)
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
244,False,False,False,False,False,False,False,False,False,False,False
245,False,False,False,False,False,False,False,False,False,False,False
246,False,False,False,False,False,False,False,False,False,False,False
247,False,False,False,False,False,False,False,False,False,False,False


## Using Two Functions

In [39]:
df.isnull().sum()

Month                      0
Python Worldwide(%)        0
JavaScript Worldwide(%)    0
Java Worldwide(%)          0
C# Worldwide(%)            0
PhP Worldwide(%)           0
Flutter Worldwide(%)       0
React Worldwide(%)         0
Swift Worldwide(%)         0
TypeScript Worldwide(%)    0
Matlab Worldwide(%)        0
dtype: int64

## Convert Object Types to Categories

In [40]:
df['Month'] = df['Month'].astype('category')

## Check Data Types

In [41]:
print(df.dtypes)

Month                      category
Python Worldwide(%)           int64
JavaScript Worldwide(%)       int64
Java Worldwide(%)             int64
C# Worldwide(%)               int64
PhP Worldwide(%)              int64
Flutter Worldwide(%)          int64
React Worldwide(%)            int64
Swift Worldwide(%)            int64
TypeScript Worldwide(%)       int64
Matlab Worldwide(%)           int64
dtype: object


## Describe Data (With Transpose)

In [42]:
print(df.describe().T)

                         count       mean        std   min   25%   50%   75%  \
Python Worldwide(%)      249.0  41.678715  23.103231  20.0  23.0  29.0  60.0   
JavaScript Worldwide(%)  249.0  43.963855  16.024512  24.0  34.0  37.0  48.0   
Java Worldwide(%)        249.0  35.995984  21.897948  11.0  17.0  34.0  45.0   
C# Worldwide(%)          249.0  59.626506  20.330958  27.0  45.0  57.0  78.0   
PhP Worldwide(%)         249.0  37.477912  22.997381  13.0  20.0  27.0  48.0   
Flutter Worldwide(%)     249.0  22.642570  28.368911   4.0   6.0   8.0  32.0   
React Worldwide(%)       249.0  25.883534  32.518210   1.0   1.0   2.0  50.0   
Swift Worldwide(%)       249.0  30.678715  15.328697   8.0  23.0  29.0  37.0   
TypeScript Worldwide(%)  249.0  23.405622  28.603098   1.0   3.0   4.0  38.0   
Matlab Worldwide(%)      249.0  59.289157  13.282170  39.0  50.0  58.0  66.0   

                           max  
Python Worldwide(%)      100.0  
JavaScript Worldwide(%)  100.0  
Java Worldwide(%)   

## Groupby Operation

### Convert 'Python Worldwide(%)' to numeric, coercing errors

In [43]:
df['Python Worldwide(%)'] = pd.to_numeric(df['Python Worldwide(%)'], errors='coerce')

### Now perform the groupby operation and calculate the mean

In [44]:
grouped_data = df.groupby('Python Worldwide(%)').mean(numeric_only=True)
print(grouped_data)

                     JavaScript Worldwide(%)  Java Worldwide(%)  \
Python Worldwide(%)                                               
20                                 35.500000          33.500000   
21                                 43.409091          42.227273   
22                                 42.740741          42.333333   
23                                 44.666667          45.809524   
24                                 52.000000          52.076923   
...                                      ...                ...   
92                                 44.000000          17.000000   
93                                 45.500000          17.500000   
94                                 45.000000          17.000000   
99                                 45.666667          17.000000   
100                                47.000000          19.000000   

                     C# Worldwide(%)  PhP Worldwide(%)  Flutter Worldwide(%)  \
Python Worldwide(%)                             

## Check if a Value is Present

In [45]:
value_present = 100 in df['PhP Worldwide(%)'].values
print(f"Is 100 present in PhP Worldwide(%): {value_present}")

Is 100 present in PhP Worldwide(%): True


## Access a Row by Index and Check Dependency

In [46]:
row_at_index = df.iloc[10]
print(row_at_index)

Month                      2004-11
Python Worldwide(%)             26
JavaScript Worldwide(%)         83
Java Worldwide(%)               90
C# Worldwide(%)                 89
PhP Worldwide(%)                84
Flutter Worldwide(%)             6
React Worldwide(%)               2
Swift Worldwide(%)              10
TypeScript Worldwide(%)          2
Matlab Worldwide(%)             92
Name: 10, dtype: object


## Go to an index location and check whether a value is dependent or independent

In [47]:
value_at_index = df.iloc[5]
value_at_index

Month                      2004-06
Python Worldwide(%)             27
JavaScript Worldwide(%)         94
Java Worldwide(%)               89
C# Worldwide(%)                 97
PhP Worldwide(%)                97
Flutter Worldwide(%)             7
React Worldwide(%)               5
Swift Worldwide(%)               9
TypeScript Worldwide(%)          2
Matlab Worldwide(%)             87
Name: 5, dtype: object