**Instructions:**

1. If you are having issues with reading in the dataset directly (which is most likely due to UCI's or your web browser's SSL settings), you can download the file on your computer manually and then upload it to your Azure project, which you can then read in as a local file.
2. This is a very small dataset. So please do not perform any sampling.
3. Make sure you follow the best practices in the Data Preperation.
4. As a general rule, all categorical features need to be assumed to be nominal unless you have evidence to the contrary.
5. This is an anonymised dataset. Thus, do not flag any numerical values as outliers regardless of their value for numerical features. As another hint, you won't have to look for outliers in categorical features either. However, you will need to look for some unusual values for both numerical and categorical features. 
6. For this question, you are to set all unusual values to missing values. Also, you are to impute any missing values with the mode for categorical features and with the median for numerical features. If there are multiple modes for a categorical feature, use the mode that comes first alphabetically.
7. For the A2 numerical descriptive feature, you are to discretize it via equal-frequency binning with 3 bins named "low", "medium", and "high", and then use integer encoding for it.
8. For normalization, you are to use standard scaling. You are allowed to use Scikit-Learn's preprocessing submodule for this purpose.
9. The target feature needs be the last column in the clean data and its name needs to be target.
10. You must perform all your preprocessing steps using Python. For any cleaning steps that you perform via Excel or simple find-and-replace in a text editor or any other language or in any other way, you will receive zero points.
11. It's critical that the final clean data does not need any further processing so that it will work without any issues with any classifier within Scikit-Learn.
12. Round all real-valued columns to 3 decimal places.
13. Once you are done, name your final clean dataset as df_clean (if it's not already named as such).
14. At the end, run each one of the following three lines in three separate code cells for a summary:
    - df_clean.shape
    - df_clean.describe(include='all').round(3) 
    - df_clean.head(5)
15. Save your final clean dataset exactly as "df_clean.csv". Make sure your file has the correct column names (including the target column).

`Tasks to complete:`
- T1:: Setting '?' cells to missing values.
- T2:: Eliminating either the A4 or A5 redundant feature. If you did not, you lost the 10 points. However, we still gave full marks for a correct follow-through.
- T3:: Imputing missing values.
- T4:: Binning numeric feature A2.
- T5:: Integer-encoding of feature A2.
- T6:: One-Hot-Encoding of categorical descriptive features. If only 2 levels, use only one binary variable. If > 2 levels, encode using q binary variables.
- T7:: Standard scaling of all descriptive features.
- T8:: Remapping the target feature with 1 as the positive class without any scaling.
- T9:: Clean data saved as a CSV file with (1) correct number of columns, (2) correct column names, (3) correct column values, and (4) with the target feature as the last column.
- T10:: Running all the 3 lines of code at the end and displaying each one of their outputs.

## Q1 Pre Processing

In [1]:
import pandas as pd
import numpy as np

### T1
**T1::** Setting '?' cells to missing values.

Reading data and setting `"?" to NA value`

In [2]:
cols = ('A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16')
df = pd.read_csv('C:/Users/piyus/OneDrive/Desktop/Brushing/Machine Learning/practice/Ass 1/crx.data',names=cols, sep=',',na_values=['?'])
df.head(4)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+


In [3]:
df.dtypes

A1      object
A2     float64
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11      int64
A12     object
A13     object
A14    float64
A15      int64
A16     object
dtype: object

In [4]:
df.isna().sum()

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

In [5]:
# define categorical and num column lists per the given data types in the data description
cat_cols = ['A1', 'A4', 'A5', 'A6','A7','A9', 'A10', 'A12', 'A13', 'A16']
num_cols = ['A2', 'A3', 'A8', 'A11','A14','A15']

In [6]:
df[cat_cols] = df[cat_cols].astype(np.object)
df[num_cols] = df[num_cols].astype(np.number)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df[cat_cols] = df[cat_cols].astype(np.object)


In [7]:
df.dtypes

A1      object
A2     float64
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11    float64
A12     object
A13     object
A14    float64
A15    float64
A16     object
dtype: object

### T3
**T3::** 10 points: Imputing missing values.

In [8]:
df[330:331]

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
330,b,20.42,0.0,,,,,0.0,f,f,0.0,f,p,,0.0,-


In [9]:
data_imputed = df.copy()

for col in cat_cols:
    mode = data_imputed[col].mode()[0]
    print(f'Mode for column {col} is :{mode}')
    data_imputed[cat_cols] = data_imputed[cat_cols].fillna(mode)
    
for col in num_cols:
    median = data_imputed[col].median()
    print(f'Median for column {col} is: {median}')
    data_imputed[num_cols] = data_imputed[num_cols].fillna(median)

Mode for column A1 is :b
Mode for column A4 is :u
Mode for column A5 is :g
Mode for column A6 is :c
Mode for column A7 is :v
Mode for column A9 is :t
Mode for column A10 is :f
Mode for column A12 is :f
Mode for column A13 is :g
Mode for column A16 is :-
Median for column A2 is: 28.46
Median for column A3 is: 2.75
Median for column A8 is: 1.0
Median for column A11 is: 0.0
Median for column A14 is: 160.0
Median for column A15 is: 5.0


In [10]:
data_imputed[330:331]

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
330,b,20.42,0.0,b,b,b,b,0.0,f,f,0.0,f,p,28.46,0.0,-


In [11]:
df = data_imputed.copy()
df[330:331]

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
330,b,20.42,0.0,b,b,b,b,0.0,f,f,0.0,f,p,28.46,0.0,-


#### ELiminating Redundant feature

We observe that this dataset does not contain any one of the following:

* ID-Like columns
* Constant features
* Date or time features

However, we notice that columns A4 and A5 have the same value counts. This is suspicious. Perhaps these two columns contain the same information? Let's do A4 to A5 mapping and then check to see if the A4 and A5 series after the mapping are equal.

### T2
**T2::** Imputing missing values

In [12]:
for col in cat_cols:
    print(df[col].value_counts(),'\n')

b    480
a    210
Name: A1, dtype: int64 

u    519
y    163
b      6
l      2
Name: A4, dtype: int64 

g     519
p     163
b       6
gg      2
Name: A5, dtype: int64 

c     137
q      78
w      64
i      59
aa     54
ff     53
k      51
cc     41
x      38
m      38
d      30
e      25
j      10
b       9
r       3
Name: A6, dtype: int64 

v     399
h     138
bb     59
ff     57
b       9
j       8
z       8
dd      6
n       4
o       2
Name: A7, dtype: int64 

t    361
f    329
Name: A9, dtype: int64 

f    395
t    295
Name: A10, dtype: int64 

f    374
t    316
Name: A12, dtype: int64 

g    625
s     57
p      8
Name: A13, dtype: int64 

-    383
+    307
Name: A16, dtype: int64 



In [13]:
A4 = df['A4'].copy()
A5 = df['A5'].copy()

A5 = A5.replace({'g':'u', 'p':'y', 'gg':'l'})

print(A4.head(10).values)
print(A5.head(10).values)

A4.equals(A5)

['u' 'u' 'u' 'u' 'u' 'u' 'u' 'u' 'y' 'y']
['u' 'u' 'u' 'u' 'u' 'u' 'u' 'u' 'y' 'y']


True

In [14]:
df = df.drop(columns='A5')

In [15]:
df.columns

Index(['A1', 'A2', 'A3', 'A4', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12',
       'A13', 'A14', 'A15', 'A16'],
      dtype='object')

### T4
**T4::** Binning numeric feature A2.

In [17]:
df_cat = df.copy()

In [19]:
df_cat['A2'] = pd.qcut(df_cat['A2'],q=3,labels=['low','medium','high'])

In [32]:
df_cat['A2'].value_counts()

medium    231
low       230
high      229
Name: A2, dtype: int64

In [33]:
df_cat.sample(n=6,random_state=11)

Unnamed: 0,A1,A2,A3,A4,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
214,b,medium,2.71,y,cc,v,5.25,t,t,1.0,f,g,211.0,0.0,+
545,b,high,11.0,y,d,v,1.5,t,f,0.0,f,s,0.0,0.0,-
436,b,low,0.585,u,ff,ff,0.0,f,t,3.0,f,g,350.0,769.0,-
201,a,high,1.0,u,i,bb,2.25,t,f,0.0,t,g,0.0,300.0,+
372,a,high,4.585,u,k,h,1.0,f,f,0.0,t,s,240.0,0.0,-
191,b,high,0.205,u,i,h,5.125,t,f,0.0,f,g,400.0,0.0,+


Now, our `column A2` is categorical.

### T5
T5:: Integer-encoding of feature A2.