# Report : Predicting bulk modulus


## 1. Data View

In [None]:
from matminer.datasets.convenience_loaders import load_elastic_tensor
df = load_elastic_tensor()  # loads dataset in a pandas DataFrame object
unwanted_columns = ["volume", "nsites", "compliance_tensor", "elastic_tensor", 
                    "elastic_tensor_original", "K_Voigt", "G_Voigt", "K_Reuss", "G_Reuss"]
df = df.drop(unwanted_columns, axis=1)
df.describe()

**使用decribe()函数可以先预览一下当前表格里面的数据**

## 2. Data Featurization

**把材料的抽象属性（比如成分、晶体结构）转化成一组可以用数字表示的特征，以便模型能够理解和处理**  

**StrToComposition** 用于将化学式从字符串拆分成可被模型读取的成分列  
  
**elementproperty-Magpie** 材料信息学里一个经典的特征体系，它包含约 145 个描述符，基于元素的基础属性
   
>  原子量 (atomic_mass) | 电负性 (X) | 原子体积 (atom_radius) | 熔点 (melting_point) | 价电子数 (valence_electrons) |  
> 以及这些属性在化合物中的统计量（均值、方差、最大值、最小值、范围等）  

**CompositionToOxidComposition** 通过化学式内元素成分推算对应氧化价态

**OxidationStates** 根据推算出的氧化价态数据计算统计特征

**DensityFeatures** 基于每个材料的晶体结构信息计算一些密度相关的几何特征。


In [None]:
from matminer.featurizers.conversions import StrToComposition
df = StrToComposition().featurize_dataframe(df, "formula")
df.head()

In [None]:
from matminer.featurizers.composition import ElementProperty

ep_feat = ElementProperty.from_preset(preset_name="magpie")
df = ep_feat.featurize_dataframe(df, col_id="composition")  # input the "composition" column to the featurizer
df.head()

In [None]:
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates

df = CompositionToOxidComposition().featurize_dataframe(df, "composition")

os_feat = OxidationStates()
df = os_feat.featurize_dataframe(df, "composition_oxid")
df.head()

In [None]:
from matminer.featurizers.structure import DensityFeatures

df_feat = DensityFeatures()
df = df_feat.featurize_dataframe(df, "structure")  # input the structure column to the featurizer
df.head()
df_feat.feature_labels()

## 3. Machine Learning

### 3.1 Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

lr = LinearRegression()

lr.fit(X, y)

# get fit statistics
print('training R2 = ' + str(round(lr.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X))))
# R2 = 0.927
# RMSE = 19.625

线性回归模型作为一个很简单的模型，误差是很大的，所以在上面的基础引入交叉验证

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(lr, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))

# Cross-validation results:
# Folds: 10, mean R2: 0.902
# Folds: 10, mean RMSE: 22.467

可以看到，在经过交叉验证之后，原本R2是0.927变成了0.902，因为我们每一次训练都随机移除了10％训练的数据，所以在这些数据被当作测试的时候模型确实可能会拟合不到，所以总体的拟合程度是会降低的。但是R2本身没有较大的波动，说明这个模型的拟合效果还是可以的

### 3.2 Random Forest

### 3.3 