<a href="https://www.kaggle.com/code/mayarmohamedswilam/start-ups?scriptVersionId=144111084" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Import necessary libraries such as NumPy, pandas, and seaborn.
List files in the '/kaggle/input/' directory.
Read a CSV file ('companies.csv') into a DataFrame.
Display basic information about the DataFrame using df.info().
Drop unnecessary columns, fill missing numerical values with 0, and convert date columns to datetime format.
Remove duplicate rows from the DataFrame.
Drop columns with a high percentage of missing values.
Drop specific columns ('normalized_name', 'entity_id', 'short_description').
Filter data to select only operating companies with funding greater than 100,000 USD.
Group the data by 'category_code' and calculate the mean of 'investment_rounds' within each category.
Calculate the total funding received by each company by multiplying 'funding_total_usd' and 'funding_rounds' and store it in a new column 'total_funding'.
Sort the DataFrame by 'total_funding' in descending order.
Count the number of companies in each category using 'value_counts()'.
Calculate the average funding received by companies with a 'status' of 'acquired'.
Calculate the age of companies based on their 'founded_at' date and the current date.
Group the data by 'status' and calculate the mean of 'milestones' and 'relationships'.
Create a pivot table to analyze the mean 'investment_rounds' for each 'category_code' within different 'status' categories.

**Data cleaning**

In [None]:
import pandas as pd

# Assuming the CSV file is in the current working directory
csv_file_path = '/kaggle/input/ofhddd/companies.csv'

# Create a DataFrame from the CSV file
df = pd.read_csv(csv_file_path)
print(df.head())  # Display the first few rows of the DataFrame

In [None]:
df.info()

In [None]:
df.shape

**Drop unnecessary columns**

In [None]:
# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

**For demonstration purposes, we'll fill missing numerical values with 0**

In [None]:
# For demonstration purposes, we'll fill missing numerical values with 0
df.fillna(0, inplace=True)

In [None]:
# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')  # Convert invalid dates to NaT

In [None]:
df.info()

In [None]:
df.head()

In [None]:
missing_percentages = (df.isnull().sum() / len(df)) * 100
print(missing_percentages)

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
threshold = 0.7  # Set a threshold for the percentage of NaN values
columns_to_drop = df.columns[df.isnull().mean() > threshold]
df.drop(columns=columns_to_drop, inplace=True)

In [None]:
columns_to_drop = ['normalized_name', 'entity_id', 'short_description']
df.drop(columns=columns_to_drop, inplace=True)

**Data Processing**

**Filtering Data**

In [None]:
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]
print(filtered_df)

**Grouping Data**

In [None]:
grouped_df = df.groupby('category_code')['investment_rounds'].mean()
print(grouped_df)

**Calculations**

calculate the total funding received by each company

In [None]:
df['total_funding'] = df['funding_total_usd'] * df['funding_rounds']
print(df[['name', 'total_funding']])

**Sorting Data**

In [None]:
sorted_df = df.sort_values(by='total_funding', ascending=False)
print(sorted_df[['name', 'total_funding']])

**Counting Categories**

 count the number of companies

In [None]:
category_counts = df['category_code'].value_counts()
print(category_counts)

**Conditional Calculations**

Calculate the average funding received by companies with a "status" of "acquired"

In [None]:
average_funding_acquired = df[df['status'] == 'acquired']['funding_total_usd'].mean()
print("Average funding for acquired companies:", average_funding_acquired)

**Creating New Columns**

*********
Calculate the age of companies based on their "founded_at" date and the current date:


In [None]:
from datetime import datetime
# Convert 'founded_at' column to datetime format
df['founded_at'] = pd.to_datetime(df['founded_at'], errors='coerce')

# Calculate company age based on the 'founded_at' column
current_year = datetime.now().year
df['company_age'] = current_year - df['founded_at'].dt.year

print(df[['name', 'company_age']])

In [None]:
df[['name', 'company_age']].head()

**Aggregation with Grouping**

In [None]:
aggregated_group = df.groupby('status')[['milestones', 'relationships']].mean()
print(aggregated_group)

**Pivot Table**

In [None]:
pivot_table = df.pivot_table(index='status', columns='category_code', values='investment_rounds', aggfunc='mean')
print(pivot_table)

**Data Labeling**

**the average value of the "software" category for companies that went public (IPO)**

In [None]:
avg_software_for_ipo = pivot_table.loc['ipo', 'software']
print("Average value of the 'software' category for IPO companies:", avg_software_for_ipo)

**Which category has the lowest average value for companies that are acquired**

In [None]:
lowest_avg_category_acquired = pivot_table.loc['acquired'].idxmin()
lowest_avg_value_acquired = pivot_table.loc['acquired'].min()
print("Category with the lowest average value for acquired companies:", lowest_avg_category_acquired)
print("Lowest average value for acquired companies:", lowest_avg_value_acquired)

**the total average value of all categories for companies that are operating**

In [None]:
total_avg_value_operating = pivot_table.loc['operating'].mean()
print("Total average value of all categories for operating companies:", total_avg_value_operating)

**Which category has the highest variation (standard deviation) of average values across different statuses?**

In [None]:
highest_variation_category = pivot_table.std().idxmax()
highest_variation_value = pivot_table.std().max()
print("Category with the highest variation of average values:", highest_variation_category)
print("Highest variation value:", highest_variation_value)

**For IPO companies, category has the highest average value per company (excluding NaN values)**

In [None]:
avg_per_company_for_ipo = pivot_table.loc['ipo'] / pivot_table.loc['ipo'].count()
highest_avg_per_company_category_ipo = avg_per_company_for_ipo.idxmax()
highest_avg_per_company_value_ipo = avg_per_company_for_ipo.max()
print("Category with the highest average value per company for IPO companies:", highest_avg_per_company_category_ipo)
print("Highest average value per company for IPO companies:", highest_avg_per_company_value_ipo)

**the most common status among companies in the "web" category**

In [None]:
most_common_status_web = pivot_table['web'].idxmax()
print("Most common status among companies in the 'web' category:", most_common_status_web)

**the highest average value for companies that are closed**

In [None]:
highest_avg_value_closed_category = pivot_table.loc['closed'].idxmax()
highest_avg_value_closed = pivot_table.loc['closed'].max()
print("Category with the highest average value for closed companies:", highest_avg_value_closed_category)
print("Highest average value for closed companies:", highest_avg_value_closed)

**the total number of companies in the "software" category for each status**

In [None]:
total_companies_software = pivot_table['software'].sum()
print("Total number of companies in the 'software' category:", total_companies_software)

**the highest average value of milestones for companies that are acquired**

In [None]:
highest_avg_analytics_category_acquired = pivot_table.loc['acquired'].idxmax()
highest_avg_analytics_value_acquired = pivot_table.loc['acquired'].max()
print("Category with the highest average value for 'acquired' companies:", highest_avg_analytics_category_acquired)
print("Highest average value for 'acquired' companies:", highest_avg_analytics_value_acquired)

**the average value of the "semiconductor" category for companies that are operating**

In [None]:
avg_semiconductor_operating = pivot_table.loc['operating', 'semiconductor']
print("Average value of the 'semiconductor' category for operating companies:", avg_semiconductor_operating)

**the average value of the "advertising" category for companies that are operating**

In [None]:
avg_advertising_for_operating = pivot_table.loc['operating', 'advertising']
print("Average value of the 'advertising' category for operating companies:", avg_advertising_for_operating)

**the highest average value for companies that went public (IPO)**

In [None]:
highest_avg_category_ipo = pivot_table.loc['ipo'].idxmax()
highest_avg_value_ipo = pivot_table.loc['ipo'].max()
print("Category with the highest average value for IPO companies:", highest_avg_category_ipo)
print("Highest average value for IPO companies:", highest_avg_value_ipo)

**the average value of the "analytics" category for companies that are closed**

In [None]:
avg_analytics_for_closed = pivot_table.loc['closed', 'analytics']
print("Average value of the 'analytics' category for closed companies:", avg_analytics_for_closed)

**For companies that are operating, what is the category with the second-highest average value**

In [None]:
second_highest_avg_category_operating = pivot_table.loc['operating'].nlargest(2).idxmin()
second_highest_avg_value_operating = pivot_table.loc['operating'].nlargest(2).min()
print("Category with the second-highest average value for operating companies:", second_highest_avg_category_operating)
print("Second-highest average value for operating companies:", second_highest_avg_value_operating)

**the average value of the "ecommerce" category for companies that are acquired**

In [None]:
avg_ecommerce_for_acquired = pivot_table.loc['acquired', 'ecommerce']
print("Average value of the 'ecommerce' category for acquired companies:", avg_ecommerce_for_acquired)

**data visualization**

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(12, 6))
pivot_table.T.plot(kind='bar', stacked=True, cmap='Set3')
plt.title("Proportion of Categories by Startup Status")
plt.xlabel("Category Code")
plt.ylabel("Proportion")
plt.xticks(rotation=45)
plt.legend(title="Startup Status")
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(data=pivot_table, palette="Set3")
plt.title("Distribution of Category Values by Startup Status")
plt.xlabel("Category Code")
plt.ylabel("Value")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
pivot_table.plot(kind='bar')
plt.title("Average Values of Categories by Startup Status")
plt.xlabel("Startup Status")
plt.ylabel("Average Value")
plt.xticks(rotation=0)
plt.legend(title="Category Code")
plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
categories = pivot_table.columns
statuses = pivot_table.index

for i, category in enumerate(categories):
    ax.bar(statuses, pivot_table[category], label=category)

ax.set_title("Average Values of Categories by Startup Status")
ax.set_xlabel("Startup Status")
ax.set_ylabel("Average Value")
ax.legend(title="Category Code")
plt.xticks(rotation=0)
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 5))
categories = pivot_table.columns
statuses = pivot_table.index

for i, category in enumerate(categories):
    ax.barh(statuses, pivot_table[category], label=category)

ax.set_title("Average Values of Categories by Startup Status")
ax.set_xlabel("Average Value")
ax.set_ylabel("Startup Status")
ax.legend(title="Category Code")
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
statuses = pivot_table.index
categories = pivot_table.columns[1:]  # Exclude the '0' column

for i, category in enumerate(categories):
    ax.fill_between(statuses, pivot_table[category], label=category)

ax.set_title("Area Plot of Categories by Startup Status")
ax.set_xlabel("Startup Status")
ax.set_ylabel("Value")
ax.legend(title="Category Code")
plt.tight_layout()

plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Group by category_code and calculate average funding
average_funding_by_category = df.groupby('category_code')['funding_total_usd'].mean().reset_index()

# Set style
sns.set(style="whitegrid")

# Create bar plot
plt.figure(figsize=(12, 6))
sns.barplot(data=average_funding_by_category, x='category_code', y='funding_total_usd')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Category Code')
plt.ylabel('Average Funding (USD)')
plt.title('Average Funding by Category Code')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='status')
plt.xlabel('Status')
plt.ylabel('Count')
plt.title('Count of Companies by Status')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='funding_total_usd', y='milestones', hue='status')
plt.xlabel('Funding (USD)')
plt.ylabel('Milestones')
plt.title('Funding vs. Milestones')
plt.legend(title='Status')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='status', y='funding_total_usd')
plt.xlabel('Status')
plt.ylabel('Funding (USD)')
plt.title('Funding Distribution by Status')
plt.tight_layout()
plt.show()

In [None]:
# Convert 'first_funding_at' to datetime format
df['first_funding_at'] = pd.to_datetime(df['first_funding_at'], errors='coerce')

# Extract year from 'first_funding_at'
df['funding_year'] = df['first_funding_at'].dt.year

# Calculate average funding per year
average_funding_by_year = df.groupby('funding_year')['funding_total_usd'].mean().reset_index()

# Create line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=average_funding_by_year, x='funding_year', y='funding_total_usd')
plt.xlabel('Year')
plt.ylabel('Average Funding (USD)')
plt.title('Average Funding Over Time')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='funding_total_usd', bins=30, kde=True)
plt.xlabel('Funding Total (USD)')
plt.ylabel('Frequency')
plt.title('Distribution of Funding Total')
plt.tight_layout()
plt.show()

In [None]:
# Select relevant numeric columns
numeric_columns = ['funding_total_usd', 'milestones', 'investment_rounds', 'relationships']

# Create pair plot
sns.pairplot(data=df[numeric_columns])
plt.tight_layout()
plt.show()

In [None]:
# Calculate correlation matrix
correlation_matrix = df[numeric_columns].corr()

# Create heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

PCA

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset
csv_file_path = '/kaggle/input/ofhddd/companies.csv'
df = pd.read_csv(csv_file_path)

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0.1', 'permalink']
df.drop(columns=columns_to_drop, inplace=True)

# Fill missing numerical values with 0
df.fillna(0, inplace=True)

# Convert date columns to datetime format
date_columns = ['first_milestone_at', 'last_milestone_at', 'created_at', 'updated_at']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')

# Filter data
filtered_df = df[(df['status'] == 'operating') & (df['funding_total_usd'] > 100000)]

# Perform PCA on selected numerical features
numerical_columns = ['investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'lat', 'lng', 'ROI']
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

n_components = 5  # Choose the number of components
pca = PCA(n_components=n_components)
pca_result = pca.fit_transform(df[numerical_columns])

explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)

# Feature engineering examples
df_encoded = pd.get_dummies(df, columns=['category_code'])  # One-hot encoding of categorical column
df['new_feature'] = df['investment_rounds'] * df['funding_rounds']  # Creating a new feature by interaction

# Extracting time-based features
df['month'] = df['created_at'].dt.month
df['day_of_week'] = df['created_at'].dt.dayofweek

MODEL

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

# Load a sample dataset (Iris dataset for classification)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature Selection: Using SelectKBest with Chi-Square test
k_best = SelectKBest(score_func=chi2, k=2)  # Select the top 2 features
X_new = k_best.fit_transform(X_train, y_train)

# Feature Selection: Using Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=1000)  # Example model
rfe = RFE(model, n_features_to_select=2)  # Select the top 2 features
X_rfe = rfe.fit_transform(X_train, y_train)

# Feature Selection: Using Tree-based Feature Importance
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)
feature_importances = rf_classifier.feature_importances_

# Feature Selection: Using Principal Component Analysis (PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=2)  # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

# Print selected features
print("SelectKBest (Chi-Square):")
print(X_new[:5])  # Display the first 5 rows of selected features

print("\nRFE (Recursive Feature Elimination):")
print(X_rfe[:5])  # Display the first 5 rows of selected features

print("\nRandom Forest Feature Importances:")
print(feature_importances)

print("\nPCA (Principal Component Analysis):")
print(X_pca[:5])  # Display the first 5 rows of PCA-transformed features

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize SelectKBest (Chi-Square) Results
sns.scatterplot(x=X_new[:, 0], y=X_new[:, 1], hue=y_train, palette='viridis')
plt.title('SelectKBest (Chi-Square) - Top 2 Features')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Visualize RFE (Recursive Feature Elimination) Results
sns.scatterplot(x=X_rfe[:, 0], y=X_rfe[:, 1], hue=y_train, palette='viridis')
plt.title('RFE (Recursive Feature Elimination) - Top 2 Features')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Visualize Random Forest Feature Importances
feature_names = X_train.columns
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_names)
plt.title('Random Forest Feature Importances')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Name')
plt.show()

# Visualize PCA (Principal Component Analysis) Results
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y_train, palette='viridis')
plt.title('PCA (Principal Component Analysis) - 2 Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data
np.random.seed(42)
data = {
    'Age': np.random.randint(18, 65, 100),  # Sample ages between 18 and 65
    'Income': np.random.normal(50000, 15000, 100),  # Sample income with mean 50,000 and std dev 15,000
    'Gender': np.random.choice(['Male', 'Female'], 100),  # Sample gender
}

df = pd.DataFrame(data)

# Basic Visualizations

# 1. Histogram
plt.figure(figsize=(8, 4))
plt.hist(df['Age'], bins=10, edgecolor='k', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

# 2. Box Plot
plt.figure(figsize=(8, 4))
sns.boxplot(x='Gender', y='Income', data=df)
plt.xlabel('Gender')
plt.ylabel('Income')
plt.title('Income by Gender')
plt.show()

# 3. Scatter Plot
plt.figure(figsize=(8, 4))
sns.scatterplot(x='Age', y='Income', data=df, hue='Gender')
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Scatter Plot of Age vs. Income')
plt.show()

# 4. Pair Plot (for exploring relationships between multiple variables)
sns.pairplot(df, hue='Gender')
plt.show()

# 5. Correlation Heatmap (for exploring correlations between numerical variables)
correlation_matrix = df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

Importing Libraries: The code begins by importing necessary Python libraries, including NumPy, Pandas, Matplotlib for data manipulation, visualization, and machine learning tasks.

Loading Iris Dataset: The Iris dataset, a popular dataset for classification, is loaded using load_iris from sklearn.datasets. This dataset contains features related to iris flowers and their species.

Data Preparation: The dataset is split into features (X) and target labels (y). Then, the data is further divided into training and testing sets using train_test_split.

Feature Selection using SelectKBest with Chi-Square Test:

SelectKBest from sklearn.feature_selection is used to select the top 2 features based on the chi-square statistical test. The selected features are stored in X_new.
Feature Selection using Recursive Feature Elimination (RFE):

RFE is applied to a logistic regression model to select the top 2 features. The selected features are stored in X_rfe.
Feature Selection using Tree-based Feature Importance (Random Forest):

A Random Forest classifier is trained on the training data (X_train and y_train), and feature importances are calculated. These importances are stored in feature_importances.
Feature Ordering:

To ensure the feature importances match the order of the features in the dataset, a list called ordered_feature_importances is created to reorder the importances based on the column order in X_train.columns.
Principal Component Analysis (PCA):

Principal Component Analysis is applied to the standardized training data (X_scaled) to reduce it to two principal components. The transformed data is stored in X_pca.
Data Visualization - Scatter Plot of PCA Components:

A scatter plot is created to visualize the first two principal components (X_pca[:, 0] and X_pca[:, 1]). Data points are colored based on their corresponding class labels.
Data Visualization - Bar Plot of Feature Importances:

A bar plot is created to visualize the feature importances obtained from the Random Forest classifier. The x-axis represents feature names, and the y-axis represents feature importances.
Print Selected Features:

The code prints the selected features obtained through SelectKBest and RFE.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

# Load a sample dataset (Iris dataset for classification)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature Selection: Using SelectKBest with Chi-Square test
k_best = SelectKBest(score_func=chi2, k=2)  # Select the top 2 features
X_new = k_best.fit_transform(X_train, y_train)

# Feature Selection: Using Recursive Feature Elimination (RFE)
model = LogisticRegression(max_iter=1000)  # Example model
rfe = RFE(model, n_features_to_select=2)  # Select the top 2 features
X_rfe = rfe.fit_transform(X_train, y_train)

# Feature Selection: Using Tree-based Feature Importance
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)
feature_importances = rf_classifier.feature_importances_

# Ensure that feature importances match the order of X_train.columns
ordered_feature_importances = [feature_importances[X_train.columns.get_loc(feature)] for feature in X_train.columns]

# Feature Selection: Using Principal Component Analysis (PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=2)  # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

# Data Visualization: Scatter Plot of PCA Components
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_train, cmap=plt.cm.Set1)
plt.title("PCA of Iris Dataset")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()

# Data Visualization: Bar Plot of Feature Importances
plt.figure(figsize=(8, 6))
plt.bar(X_train.columns, ordered_feature_importances)
plt.title("Feature Importances (Random Forest)")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.xticks(rotation=45)
plt.show()

# Print selected features
print("SelectKBest (Chi-Square):")
print(X_new[:5])  # Display the first 5 rows of selected features

print("\nRFE (Recursive Feature Elimination):")
print(X_rfe[:5])  # Display the first 5 rows of selected features