# Credit Card Transactions Dataset

The Credit Card Transactions Dataset provides detailed records of credit card transactions, including information about transaction times, amounts, and associated personal and merchant details. This dataset has over 1,296,675 rows.

## Data Columns

### 1. trans_date_trans_time
- **Description**: This column contains the date and time of the transaction.
- **Type**: Timestamp or string (depending on how it's formatted).
- **Usage**: Useful for analyzing transaction patterns over time, such as daily/weekly/monthly trends or detecting anomalies during specific periods.

### 2. cc_num
- **Description**: The credit card number associated with the transaction.
- **Type**: Numeric or string (if padded with leading zeros).
- **Usage**: Identifies the specific credit card used in the transaction. Can be used to track transactions made by the same card.

### 3. merchant
- **Description**: The name or identifier of the merchant where the transaction occurred.
- **Type**: String.
- **Usage**: Helps categorize transactions based on the merchant and can be useful for fraud detection if certain merchants are flagged as high-risk.

### 4. category
- **Description**: The category of the merchant or the type of purchase (e.g., "grocery", "travel", "entertainment").
- **Type**: Categorical (string).
- **Usage**: Provides insight into the nature of the transaction and can help identify unusual spending behavior.

### 5. amt
- **Description**: The amount of money involved in the transaction.
- **Type**: Numeric (float or integer).
- **Usage**: Critical for fraud detection, as unusually large or small amounts might indicate suspicious activity.

### 6. first
- **Description**: The first name of the cardholder.
- **Type**: String.
- **Usage**: Can be used for identity verification or linking transactions to a specific individual.

### 7. last
- **Description**: The last name of the cardholder.
- **Type**: String.
- **Usage**: Similar to `first`, this helps identify the cardholder and link transactions to an individual.

### 8. gender
- **Description**: The gender of the cardholder.
- **Type**: Categorical (string, e.g., "M" for male, "F" for female).
- **Usage**: May be used for demographic analysis or to identify patterns in spending behavior based on gender.

### 9. street
- **Description**: The street address of the cardholder.
- **Type**: String.
- **Usage**: Useful for verifying the cardholder's location and detecting potential fraud if transactions occur far from the cardholder's home address.

### 10. city
- **Description**: The city where the cardholder resides.
- **Type**: String.
- **Usage**: Similar to `street`, this can help verify the cardholder's location and detect anomalies.

### 11. state
- **Description**: The state or province where the cardholder resides.
- **Type**: String.
- **Usage**: Another geographical indicator that can help in fraud detection.

### 12. zip
- **Description**: The ZIP code of the cardholder's residence.
- **Type**: Numeric or string.
- **Usage**: Useful for geolocation and detecting transactions that occur outside the cardholder's usual area.

### 13. lat
- **Description**: The latitude coordinate of the cardholder's location.
- **Type**: Numeric (float).
- **Usage**: Used for geospatial analysis to determine the proximity of the transaction location to the cardholder's home.

### 14. long
- **Description**: The longitude coordinate of the cardholder's location.
- **Type**: Numeric (float).
- **Usage**: Complements `lat` for geospatial analysis.

### 15. city_pop
- **Description**: The population of the city where the cardholder resides.
- **Type**: Numeric (integer).
- **Usage**: May be used for demographic analysis or to identify patterns in spending behavior based on urban vs. rural areas.

### 16. job
- **Description**: The occupation or job title of the cardholder.
- **Type**: String.
- **Usage**: Useful for understanding the cardholder's income level or spending habits based on their profession.

### 17. dob
- **Description**: The date of birth of the cardholder.
- **Type**: Date or string.
- **Usage**: Helps calculate the cardholder's age, which can be used for demographic analysis or to identify age-related spending patterns.

### 18. trans_num
- **Description**: A unique identifier for each transaction.
- **Type**: String or numeric.
- **Usage**: Ensures each transaction can be uniquely identified and tracked.

### 19. unix_time
- **Description**: The timestamp of the transaction in Unix time format (number of seconds since January 1, 1970).
- **Type**: Numeric (integer).
- **Usage**: Useful for precise time-based analysis and comparisons across different systems.

### 20. merch_lat
- **Description**: The latitude coordinate of the merchant's location.
- **Type**: Numeric (float).
- **Usage**: Used for geospatial analysis to determine the distance between the cardholder and the merchant.

### 21. merch_long
- **Description**: The longitude coordinate of the merchant's location.
- **Type**: Numeric (float).
- **Usage**: Complements `merch_lat` for geospatial analysis.

### 22. is_fraud
- **Description**: Indicates whether the transaction is fraudulent (1) or legitimate (0).
- **Type**: Binary (integer, 0 or 1).
- **Usage**: The target variable for fraud detection models. Used to train machine learning algorithms to predict fraud.

### 23. merch_zipcode
- **Description**: The ZIP code of the merchant's location.
- **Type**: Numeric or string.
- **Usage**: Useful for comparing the merchant's location with the cardholder's location to detect potential fraud.

## Summary

This dataset contains a mix of transactional, demographic, and geospatial data. It can be used for various purposes, such as:
- Fraud detection: Analyzing patterns and anomalies in transactions.
- Customer segmentation: Grouping customers based on their spending behavior, location, or demographics.
- Risk assessment: Identifying high-risk merchants or regions.

# import needed libraries

In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [3]:
pd.set_option('display.max_columns',25)

# read the data

In [4]:
df=pd.read_csv('credit_card_transactions.csv')
df

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,28654,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0,28705.0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,99160,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0,
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,83252,42.1808,-112.2620,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0,83236.0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.00,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,59632,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0,
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,24433,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0,22844.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1296670,1296670,2020-06-21 12:12:08,30263540414123,fraud_Reichel Inc,entertainment,15.56,Erik,Patterson,M,162 Jessica Row Apt. 072,Hatch,UT,84735,37.7175,-112.4777,258,Geoscientist,1961-11-24,440b587732da4dc1a6395aba5fb41669,1371816728,36.841266,-111.690765,0,
1296671,1296671,2020-06-21 12:12:19,6011149206456997,fraud_Abernathy and Sons,food_dining,51.70,Jeffrey,White,M,8617 Holmes Terrace Suite 651,Tuscarora,MD,21790,39.2667,-77.5101,100,"Production assistant, television",1979-12-11,278000d2e0d2277d1de2f890067dcc0a,1371816739,38.906881,-78.246528,0,22630.0
1296672,1296672,2020-06-21 12:12:32,3514865930894695,fraud_Stiedemann Ltd,food_dining,105.93,Christopher,Castaneda,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,NM,88325,32.9396,-105.8189,899,Naval architect,1967-08-30,483f52fe67fabef353d552c1e662974c,1371816752,33.619513,-105.130529,0,88351.0
1296673,1296673,2020-06-21 12:13:36,2720012583106919,"fraud_Reinger, Weissnat and Strosin",food_dining,74.90,Joseph,Murray,M,42933 Ryan Underpass,Manderson,SD,57756,43.3526,-102.5411,1126,Volunteer coordinator,1980-08-18,d667cdcbadaaed3da3f4020e83591c83,1371816816,42.788940,-103.241160,0,69367.0


# build needed function 

In [33]:
def cat(feature):
    ''' finds the unique vaues for a cat feature '''
    return len(df[feature].unique()),df[feature].unique()

# explor the data

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 24 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [5]:
df.duplicated().sum()

0

## Explor numeric features

In [7]:
num_columns = df.select_dtypes(include=['number']).columns.tolist()
num_columns

['Unnamed: 0',
 'cc_num',
 'amt',
 'zip',
 'lat',
 'long',
 'city_pop',
 'unix_time',
 'merch_lat',
 'merch_long',
 'is_fraud',
 'merch_zipcode']

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode
count,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1100702.0
mean,648337.0,4.17192e+17,70.35104,48800.67,38.53762,-90.22634,88824.44,1349244000.0,38.53734,-90.22646,0.005788652,46825.75
std,374318.0,1.308806e+18,160.316,26893.22,5.075808,13.75908,301956.4,12841280.0,5.109788,13.77109,0.07586269,25834.0
min,0.0,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,1325376000.0,19.02779,-166.6712,0.0,1001.0
25%,324168.5,180042900000000.0,9.65,26237.0,34.6205,-96.798,743.0,1338751000.0,34.73357,-96.89728,0.0,25114.0
50%,648337.0,3521417000000000.0,47.52,48174.0,39.3543,-87.4769,2456.0,1349250000.0,39.36568,-87.43839,0.0,45860.0
75%,972505.5,4642255000000000.0,83.14,72042.0,41.9404,-80.158,20328.0,1359385000.0,41.95716,-80.2368,0.0,68319.0
max,1296674.0,4.992346e+18,28948.9,99783.0,66.6933,-67.9503,2906700.0,1371817000.0,67.51027,-66.9509,1.0,99403.0


## Explore categorical feature

In [9]:
cat_columns = df.select_dtypes(include=['object']).columns.tolist()
cat_columns

['trans_date_trans_time',
 'merchant',
 'category',
 'first',
 'last',
 'gender',
 'street',
 'city',
 'state',
 'job',
 'dob',
 'trans_num']

In [10]:
for feature in cat_columns :
    print('='*50)
    print(feature)
    print(cat(feature))

trans_date_trans_time
(1274791, array(['2019-01-01 00:00:18', '2019-01-01 00:00:44',
       '2019-01-01 00:00:51', ..., '2020-06-21 12:12:32',
       '2020-06-21 12:13:36', '2020-06-21 12:13:37'], dtype=object))
merchant
(693, array(['fraud_Rippin, Kub and Mann', 'fraud_Heller, Gutmann and Zieme',
       'fraud_Lind-Buckridge', 'fraud_Kutch, Hermiston and Farrell',
       'fraud_Keeling-Crist', 'fraud_Stroman, Hudson and Erdman',
       'fraud_Rowe-Vandervort', 'fraud_Corwin-Collins',
       'fraud_Herzog Ltd', 'fraud_Schoen, Kuphal and Nitzsche',
       'fraud_Rutherford-Mertz', 'fraud_Kerluke-Abshire',
       'fraud_Lockman Ltd', 'fraud_Kiehn Inc', 'fraud_Beier-Hyatt',
       'fraud_Schmidt and Sons', 'fraud_Lebsack and Sons',
       'fraud_Mayert Group', 'fraud_Konopelski, Schneider and Hartmann',
       'fraud_Schultz, Simonis and Little', 'fraud_Bauch-Raynor',
       'fraud_Harris Inc', 'fraud_Kling-Grant', 'fraud_Pacocha-Bauch',
       'fraud_Lesch Ltd', 'fraud_Kunde-Sanford', "f

# dealing with nan 

In [11]:
df.isna().sum()

Unnamed: 0                    0
trans_date_trans_time         0
cc_num                        0
merchant                      0
category                      0
amt                           0
first                         0
last                          0
gender                        0
street                        0
city                          0
state                         0
zip                           0
lat                           0
long                          0
city_pop                      0
job                           0
dob                           0
trans_num                     0
unix_time                     0
merch_lat                     0
merch_long                    0
is_fraud                      0
merch_zipcode            195973
dtype: int64

### the col merch_zipcode will be removed 
the null percentage is `15%` so will drop the records where the merch_zipcode is nan because it is categorical

# drop nois columns

In [6]:
df.drop(columns=['Unnamed: 0','first','last'],axis=1,inplace=True)

In [7]:
# Assuming you want to drop rows where merch_zipcode is NaN
df = df.dropna(subset=['merch_zipcode'])

In [8]:
df.isna().sum() 

trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
merch_zipcode            0
dtype: int64

In [8]:
len(df.columns)

21

In [9]:
df.columns

Index(['trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt',
       'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'city_pop',
       'job', 'dob', 'trans_num', 'unix_time', 'merch_lat', 'merch_long',
       'is_fraud', 'merch_zipcode'],
      dtype='object')

In [9]:
df['trans_date_trans_time']=pd.to_datetime(df['trans_date_trans_time'])
df['dob']=pd.to_datetime(df['dob'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['trans_date_trans_time']=pd.to_datetime(df['trans_date_trans_time'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['dob']=pd.to_datetime(df['dob'])


# feature engneering

In [10]:
df['age'] = ((df['trans_date_trans_time'] - df['dob']).dt.days / 365.25).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'] = ((df['trans_date_trans_time'] - df['dob']).dt.days / 365.25).astype(int)


In [11]:
df['year']=pd.to_datetime(df['trans_date_trans_time']).dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year']=pd.to_datetime(df['trans_date_trans_time']).dt.year


In [12]:
df['month']=pd.to_datetime(df['trans_date_trans_time']).dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['month']=pd.to_datetime(df['trans_date_trans_time']).dt.month


In [13]:
df['day']=pd.to_datetime(df['trans_date_trans_time']).dt.day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['day']=pd.to_datetime(df['trans_date_trans_time']).dt.day


In [14]:
df['hour']=pd.to_datetime(df['trans_date_trans_time']).dt.hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['hour']=pd.to_datetime(df['trans_date_trans_time']).dt.hour


In [15]:
df

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,gender,street,city,state,zip,lat,long,...,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode,age,year,month,day,hour
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,F,561 Perry Cove,Moravian Falls,NC,28654,36.0788,-81.1781,...,1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0,28705.0,30,2019,1,1,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,M,594 White Dale Suite 530,Malad City,ID,83252,42.1808,-112.2620,...,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0,83236.0,56,2019,1,1,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,M,408 Bradley Rest,Doe Hill,VA,24433,38.4207,-79.4629,...,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0,22844.0,32,2019,1,1,0
5,2019-01-01 00:04:08,4767265376804500,"fraud_Stroman, Hudson and Erdman",gas_transport,94.63,F,4655 David Island,Dublin,PA,18917,40.3750,-75.2045,...,1961-06-19,189a841a0a8ba03058526bcfe566aab5,1325376248,40.653382,-76.152667,0,17972.0,57,2019,1,1,0
7,2019-01-01 00:05:08,6011360759745864,fraud_Corwin-Collins,gas_transport,71.65,M,231 Flores Pass Suite 720,Edinburg,VA,22824,38.8432,-78.6003,...,1947-08-21,6d294ed2cc447d2c71c7171a3d54967c,1325376308,38.948089,-78.540296,0,22644.0,71,2019,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1296669,2020-06-21 12:11:36,4400011257587661852,fraud_Stiedemann Inc,misc_pos,37.38,F,474 Allen Haven,North Loup,NE,68859,41.4972,-98.7858,...,1980-09-15,9a7ea2625cf8303efe34e3c09546868f,1371816696,41.728638,-99.039660,0,68837.0,39,2020,6,21,12
1296671,2020-06-21 12:12:19,6011149206456997,fraud_Abernathy and Sons,food_dining,51.70,M,8617 Holmes Terrace Suite 651,Tuscarora,MD,21790,39.2667,-77.5101,...,1979-12-11,278000d2e0d2277d1de2f890067dcc0a,1371816739,38.906881,-78.246528,0,22630.0,40,2020,6,21,12
1296672,2020-06-21 12:12:32,3514865930894695,fraud_Stiedemann Ltd,food_dining,105.93,M,1632 Cohen Drive Suite 639,High Rolls Mountain Park,NM,88325,32.9396,-105.8189,...,1967-08-30,483f52fe67fabef353d552c1e662974c,1371816752,33.619513,-105.130529,0,88351.0,52,2020,6,21,12
1296673,2020-06-21 12:13:36,2720012583106919,"fraud_Reinger, Weissnat and Strosin",food_dining,74.90,M,42933 Ryan Underpass,Manderson,SD,57756,43.3526,-102.5411,...,1980-08-18,d667cdcbadaaed3da3f4020e83591c83,1371816816,42.788940,-103.241160,0,69367.0,39,2020,6,21,12


# Encoding

In [16]:
df['state'].unique()

array(['NC', 'ID', 'VA', 'PA', 'TN', 'IA', 'WV', 'FL', 'NJ', 'OK', 'IN',
       'MA', 'TX', 'WI', 'MI', 'WY', 'HI', 'LA', 'DC', 'KY', 'NY', 'MS',
       'KS', 'AL', 'WA', 'AR', 'MD', 'GA', 'ME', 'CA', 'NE', 'MN', 'OH',
       'VT', 'MO', 'SC', 'OR', 'IL', 'NH', 'CO', 'SD', 'MT', 'ND', 'CT',
       'AZ', 'UT', 'NM', 'NV', 'RI', 'DE'], dtype=object)

In [17]:
category_mapping = {
    'misc_net': 'misc',
    'misc_pos': 'misc',
    'grocery_pos': 'grocery',
    'grocery_net': 'grocery',
    'shopping_net': 'shopping',
    'shopping_pos': 'shopping',
    'food_dining': 'travel',
    'personal_care': 'travel',
    'health_fitness': 'travel',
    'kids_pets': 'travel',
    'travel': 'travel',
    'home': 'home',
    'entertainment': 'misc', 
    'gas_transport': 'misc'  
}
df['category'] = df['category'].map(category_mapping)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['category'] = df['category'].map(category_mapping)


In [18]:
# encoding gender col
encoder=LabelEncoder()
encoded=encoder.fit_transform(df['gender'])
df['gender']=encoded

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['gender']=encoded


In [19]:
encoded_cat=encoder.fit_transform(df['category'])
df['category']=encoded_cat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['category']=encoded_cat


In [20]:
state_means = df.groupby('state')['is_fraud'].mean()
df['state'] = df['state'].map(state_means)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['state'] = df['state'].map(state_means)


In [21]:
state_means = df.groupby('job')['is_fraud'].mean()
df['job'] = df['job'].map(state_means)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['job'] = df['job'].map(state_means)


In [22]:
df.drop(columns=['trans_date_trans_time','dob'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['trans_date_trans_time','dob'],inplace=True)


In [50]:
df

Unnamed: 0,category,amt,gender,state,city_pop,job,is_fraud,age,year,month,day,hour
0,2,4.97,0,0.004923,3495,0.001693,0,30,2019,1,1,0
1,0,107.23,0,0.005073,149,0.002157,0,40,2019,1,1,0
2,2,220.11,1,0.001984,4154,0.015656,0,56,2019,1,1,0
3,2,45.00,1,0.002722,1939,0.007905,0,51,2019,1,1,0
4,2,41.96,1,0.006769,99,0.000000,0,32,2019,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1296670,2,15.56,1,0.005701,258,0.005338,0,58,2020,6,21,12
1296671,4,51.70,1,0.005994,100,0.015066,0,40,2020,6,21,12
1296672,4,105.93,1,0.004998,899,0.006103,0,52,2020,6,21,12
1296673,4,74.90,1,0.006005,1126,0.003953,0,39,2020,6,21,12


In [25]:
df.to_csv('cleaned.csv',index=False)

In [24]:
df

Unnamed: 0,cc_num,merchant,category,amt,gender,street,city,state,zip,lat,long,city_pop,job,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode,age,year,month,day,hour
0,2703186189652095,"fraud_Rippin, Kub and Mann",2,4.97,0,561 Perry Cove,Moravian Falls,0.004781,28654,36.0788,-81.1781,3495,0.001696,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0,28705.0,30,2019,1,1,0
2,38859492057661,fraud_Lind-Buckridge,2,220.11,1,594 White Dale Suite 530,Malad City,0.002517,83252,42.1808,-112.2620,4154,0.011834,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0,83236.0,56,2019,1,1,0
4,375534208663984,fraud_Keeling-Crist,2,41.96,1,408 Bradley Rest,Doe Hill,0.006732,24433,38.4207,-79.4629,99,0.000000,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0,22844.0,32,2019,1,1,0
5,4767265376804500,"fraud_Stroman, Hudson and Erdman",2,94.63,0,4655 David Island,Dublin,0.005638,18917,40.3750,-75.2045,2158,0.011686,189a841a0a8ba03058526bcfe566aab5,1325376248,40.653382,-76.152667,0,17972.0,57,2019,1,1,0
7,6011360759745864,fraud_Corwin-Collins,2,71.65,1,231 Flores Pass Suite 720,Edinburg,0.006732,22824,38.8432,-78.6003,6018,0.012381,6d294ed2cc447d2c71c7171a3d54967c,1325376308,38.948089,-78.540296,0,22644.0,71,2019,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1296669,4400011257587661852,fraud_Stiedemann Inc,2,37.38,0,474 Allen Haven,North Loup,0.007497,68859,41.4972,-98.7858,509,0.014447,9a7ea2625cf8303efe34e3c09546868f,1371816696,41.728638,-99.039660,0,68837.0,39,2020,6,21,12
1296671,6011149206456997,fraud_Abernathy and Sons,4,51.70,1,8617 Holmes Terrace Suite 651,Tuscarora,0.005960,21790,39.2667,-77.5101,100,0.015066,278000d2e0d2277d1de2f890067dcc0a,1371816739,38.906881,-78.246528,0,22630.0,40,2020,6,21,12
1296672,3514865930894695,fraud_Stiedemann Ltd,4,105.93,1,1632 Cohen Drive Suite 639,High Rolls Mountain Park,0.005325,88325,32.9396,-105.8189,899,0.005601,483f52fe67fabef353d552c1e662974c,1371816752,33.619513,-105.130529,0,88351.0,52,2020,6,21,12
1296673,2720012583106919,"fraud_Reinger, Weissnat and Strosin",4,74.90,1,42933 Ryan Underpass,Manderson,0.006285,57756,43.3526,-102.5411,1126,0.004062,d667cdcbadaaed3da3f4020e83591c83,1371816816,42.788940,-103.241160,0,69367.0,39,2020,6,21,12
