## Profitable App Profiles for the App Store and Google Play Markets

First Guided Project in dataquest.io - Data Analys in Python.<br>
Exploring two csv datasets:<br>
* **AppleStore.csv** - containing data about approximately seven thousand iOS apps from the App Store.<br> You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). [Data set description](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* **googleplaystore.csv** - containing data about approximately ten thousand Android apps from Google Play.<br> You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv). [Data set description](https://www.kaggle.com/lava18/google-play-store-apps)


## Opening the csv files

In [None]:
import csv
import os

def opening_csv_files(file_path):
    with open(file_path, encoding='utf-8', mode='r') as csv_file:
        data_set = list(csv.reader(csv_file))
        
        data_header = data_set[0]
        data_body = data_set[1:]
        
    return data_header, data_body

# AppleStore data set
ios_header, ios_dataset = opening_csv_files('src\AppleStore.csv')

# Google Play data set
android_header, android_dataset = opening_csv_files('src\googleplaystore.csv')

# Check headers
print('Apple Store header:\n{}\n'.format(ios_header))
print('Google Play header:\n{}\n'.format(android_header))

# Length of data sets
print('# of rows in Apple Store data set: {}'.format(len(ios_dataset)))
print('# of rows in Google Play data set: {}'.format(len(android_dataset)))

## Exploring data sets
* Print some rows
* Print # of rows
* Print # of columns

In [None]:
def explore_data(dataset, slice_start, slice_end, rows_columns_count=False):
    """
    dataset: list
    slice_start: starting row of a data slice
    slice_end: ending row of a data slice
    rows_columns_count: if True prints the number of rows and columns
    """
    if slice_start > slice_end:
        slice_end = slice_start + 1
    
    slice_of_data = dataset[slice_start:slice_end]
    
    for row in slice_of_data:
        print(row)
        print('='*100)
        
    if rows_columns_count:
        print('# of rows: {}'.format(len(dataset)))
        print('# of columns: {}'.format(len(dataset[0])))
        
print('Quick view of Apple Store data set: \n' + '-'*50 )
print(ios_header)
print('\n')
explore_data(ios_dataset, 0, 3, True)

print('\n')
print('Quick view of Google Play data set: \n' + '-'*50 )
print(android_header)
print('\n')
explore_data(android_dataset, 0, 3, True)

## Deleting wrong data
The Google Play data set has a [dedicated discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that one of the [discussions outlines an error for row 10472](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). In order to prevent this kind of errors all the entries field number checked and compared to the # of header columns.

In [None]:
def len_check_and_log(dataset, dataset_id, header_row):
    for row in dataset:
        if len(row) != len(header_row):
            # add entry to log file
            error_log(error_data=row, dataset_id=dataset_id, header_row=header_row)
            print('Data length error in line in dataset - {}:\n{}'.format(dataset_id, row))
            del row
            
def error_log(error_data, dataset_id, header_row):
    if dataset_id == 'android':
        log_entry_file = 'src/android_error_log.csv' 
    elif dataset_id == 'ios':
        log_entry_file = 'src/ios_error_log.csv'
    
    # Check if log file exists, if not create the file and add header
    if os.path.exists(log_entry_file) == False:
        with open(log_entry_file, 'w', newline='') as log_file:
            writer = csv.writer(log_file)
            writer.writerow(header_row)
    
    # Add wrong line to log file
    with open(log_entry_file, 'a', newline='') as log_file:
        log_entries = csv.writer(log_file)
        log_entries.writerow(error_data)
        
len_check_and_log(android_dataset, 'android', android_header)
len_check_and_log(ios_dataset, 'ios', ios_header)

## Checking for duplicated entries
In the Google data set checking for duplicated entries by App name